Make CLI exit code useful for detecting `tmt` errors

comps commented 4 months ago

Please change the reserved exit codes, as documented on, https://tmt.readthedocs.io/en/stable/code/autodocs/tmt.html#tmt.cli.TmtExitCode , to use values 2 or greater for test results, leaving 1 for internal TMT-specific errors, which should always be investigated by the infrastructure owner.

These would be ie.

TMT bug causing a python traceback (which naturally exits with 1)
invalid command line arguments
- or CLI arguments with invalid values, ie. existing tmt run -i directory
missing python module (causing NameError, see first point)
inability to create a writable temporary directory
etc.

and potentially even

discover from a git URL returning HTTP errors
other "setup-style" tasks

LecrisUT commented 4 months ago

Would require some coordination with downstream, maybe gating this by an env-variable TMT_NEW_EXIT_CODES. Some more reference ^1.

It does seem more in line to have:

1 catch-all internal error, maybe including Python module import errors, etc.
2 syntax error either in cli or fmf files
other codes still to be discussed

Pytest does not seem to use such convention, but it's also only half aligned with tmt's. Why it might make sense to align more with other standards is that tmt can be used outside of test wrapping, e.g. tmt show, tmt run discover, etc.

psss commented 4 months ago

Yeah, I would also lean towards keeping the current exit code definition and rather catching & redirecting all errors to a proper exit code. Unless really necessary let's not break the backward compatibility.

happz commented 4 months ago

Yeah, I would also lean towards keeping the current exit code definition and rather catching & redirecting all errors to a proper exit code. Unless really necessary let's not break the backward compatibility.

Isn't that already happening in https://github.com/teemtee/tmt/blob/main/tmt/__main__.py#L15? Everything except for SystemExit gets mapped to 2, "Errors occured during test execution.", which might be incorrect as many errors may happen before executing any test (e.g. the invalid URL used in discover).

We could add a new exit code, to be used for tmt's internal errors, but at least some of them are outside of our current control - Click will raise click.exceptions.NoSuchOption and turn it into SystemExit(2) we do not intercept...

comps commented 4 months ago

To clarify: My use case, essentially, is to catch any errors from an infrastructure PoV. I don't care if the tests reported (using custom format or not) pass/fail/error/skip/whatever.

I just want to know (somehow, ideally exit code, but doesn't have to be) that TMT failed to execute tests or retrieve their results.

Currently, I have to grep TMT's output for things like

^Error:
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$
^plan failed$
etc.

LecrisUT commented 4 months ago

Does using exit code 2 not work in this case? Any non-test errors should be mapped to it

comps commented 4 months ago

Does using exit code 2 not work in this case? Any non-test errors should be mapped to it

The problem is that any tests which report error (due to their internal logic, or just because they want to via result:custom) will also result in exit code 2.

If this was moved to some other exit code, then I could indeed use 2.

LecrisUT commented 4 months ago

Well, yes, but isn't that intended to be the case? error could mean that e.g. beakerlib was not installed, or anything to indicate that the testing infrastructure is not well prepared.

comps commented 4 months ago

The point is that error is a completely valid result status, and we use it (in our python non-Beakerlib suite) for several cases, ie.

a failure-waiving rule matched a pass
some timeouts inside a test were exceeded
installing a nested VM failed on Anaconda
a service (ie. osbuild-composer) failed to restart after the test made configuration changes to it
etc.

We define, essentially,

pass = test managed to run the expected scenarios and verify that they work
fail = test managed to run the expected scenarios, verify that its results were valid (nothing test-related was broken) and the tested functionality failed
error = a problem happened that prevented the test from testing the functionality

This means if we see a fail, it's almost always a new bug.

You can't rely on a test using error to mean an infrastructure error.

teemtee / tmt

Make CLI exit code useful for detecting `tmt` errors #2756