grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io
GNU Affero General Public License v3.0
26.2k stars 1.27k forks source link

Unify and standardize behavior and exit codes when k6 is stopped #2804

Open na-- opened 1 year ago

na-- commented 1 year ago

k6 has multiple different ways to stop a test:

  1. Single Ctrl+C for semi-graceful shutdown
  2. Second Ctrl+C for immediate abort
  3. The test.abort() JS API from k6/execution
  4. JS script exception
  5. Via the REST API, e.g. with curl -i -X PATCH -d '{"data":{"type":"status","id":"default","attributes":{"stopped":true}}}' 'http://localhost:6565/v1/status'
  6. A threshold with abortOnFail: true

And, for reference, we currently have these exit codes predefined: https://github.com/grafana/k6/blob/1b9c5fa55598363501a3db7c44cccddc8430af8f/errext/exitcodes/codes.go#L10-L21

The problem is that stopping the test in different ways and at different times behaves very inconsistently and unpredictably, even when you account for the unavoidable peculiarities of k6... :sob: Here is what I've determined:

This is connected to https://github.com/grafana/k6/issues/2790 and https://github.com/grafana/k6/issues/1889, but goes way beyond them... All of these behaviors should be standardized before test suites (https://github.com/grafana/k6/issues/1342) can be implemented, for example.

In general, it makes sense for all external and internal test aborts to make k6 exit with a non-zero exit code. We could maybe have different exit codes, depending on the cause, so users can filter out expected ones. Maybe we can even make some of these configurable (https://github.com/grafana/k6/issues/870, https://github.com/grafana/k6/issues/680) but, by default, it makes sense to me that the default behavior should be a non-zero exit code when the test was prematurely stopped in any way.

The only way a k6 process should exit with a 0 exit code is if the test finished normally and no thresholds failed.

na-- commented 1 year ago

There is actually a 7th way to stop a test - timeouts :sweat_smile:

Specifically, both setup() and teardown() have timeout values (60s by default and configurable by the setupTimeout and teardownTimeout options respectively). handleSummary() also has a fixed 2-minute timeout that isn't configurable at the moment.

These timeouts have their own exit and run_status codes and everything, so they are not just a normal script error or something like that.

na-- commented 1 year ago

It's also worth considering and testing that there is probably a difference between script exceptions (e.g. throw new Error('foo');) and script errors (e.g. wrong JS syntax or things like using await in a non-async function). For example:

na-- commented 1 year ago

@imiric raised a very good suggestion in https://github.com/grafana/k6/pull/2810#discussion_r1052129443 - now that https://github.com/grafana/k6/pull/2810 will add a errext.AbortReason type, which will track the internal k6 test run error, we can stop manually assigning exit codes to errors deep in the codebase :tada: We should be able to have a AbortReason -> ExitCode mapping the same way that PR adds a AbortReason -> cloudapi.RunStatus mappnig!

na-- commented 1 year ago

https://github.com/grafana/k6/pull/2885 was a reminder that k6 run --paused is kind of it's own weird in-between state, somewhere after initializing the VUs and before setup() execution... :disappointed: So many cases... :sob:

na-- commented 1 year ago

As https://github.com/grafana/k6/pull/2885 and https://github.com/grafana/k6/pull/2893 have proven, --linger is somewhat big complication in the k6 run logic...

In general, if a script error or test.abort() occurs during the VU init phase, --linger should not apply and k6 should exit immediately with the appropriate exit code for whatever aborted the init. But if test.abort() is called during the test run itself, or the test was stopped in some other way besides Ctrl+C (e.g. REST API, thresholds with abortOnFail), --linger means that k6 should not exit immediately after stopping the test.

na-- commented 1 year ago

Another way a k6 test (the 8th? :scream:) can be stopped is when a specific output tells it to :weary: See https://github.com/grafana/k6/blob/b85d09d53d4373b48e08eb33ca0e011bcebdffdc/output/types.go#L69-L76

na-- commented 1 year ago

Connected to https://github.com/grafana/k6/issues/2804#issuecomment-1414094937, the exit code of a k6 run --out cloud test with K6_CLOUD_STOP_ON_ERROR=true that is stopped from the k6 cloud app is currently wrong, it's -1 (i.e. 255) when it should probably be 105 (ExternalAbort) .

urugator commented 11 months ago

Would it be possible to pass custom exit code to test.abort() and also include the message and the code in the summary?

We have tests that are not interested about metrics, but about specific functional failures that occur only under load. Typically we need to know what happened and provide some contextual data eg:

  if (res.status === 400 && res.json('error_code') === 666) {
    const context = JSON.stringify({      
      reqBody,
      resStatus: res.status,
      resBody: res.body,
    });
    execution.test.abort(`error_code must not be 666: ${context}`);
  } 

I would probably like to have something like:

execution.test.abort({
  exitCode: 1666,
  message: `error_code must not be 666`,
  context: whatever, // I assume this has to be serializable
});
/*
The summary would then contain:
type SummaryData = {
  result: {
    outcome: 'passed' | 'aborted' | ...
    exitCode: number,
    context: unknown,
  }
  // ...other props ...
}
Btw summary data type is not provided by types/k6
*/

We also have some post-test scripts that depend on what went wrong, so it's quite useful to reflect this in the exit code, without the need to inspect summary/output.

There are workarounds like counter + threshold[count<1] + check with dynamic name (contextual data serialized in the check name) or looking up the specific check in the output (contextual data in tags), but it's quite a hassle for such a simple thing.

I noticed discussions about "angry checks" (checks interrupting iteration), maybe there should be a "furious check" (check aborting a test) :D

Please excuse me if this is not a good place to discuss this.

joanlopez commented 3 months ago

After #3876 and #3923, Cloud runs with thresholds crossed are consistently being reported as ThresholdsHaveFailed, which is what happens with local executions under the same conditions.

Before, those were being reported as CloudTestRunFailed, which wasn't accurate enough, as k6 users didn't know why the test run failed in the cloud (but the reason is proven to be known).

Both will be shipped in v0.54 🚀