feature(argus): New Failure Status

k0machi commented 11 months ago

This commit adds one new failure status (to be set manually for now) - TEST-ERROR. The old FAILED status is now considered a run failure, while the new one allows tracking framework or infrastructure issues (e.g. spot instance not available) or when the job fails due to testing framework bug.

Fixes #316

k0machi commented 11 months ago

Demo:

mykaul commented 11 months ago

Thanks - this is what we need - we'll be able to sort out where we are with infra (test, framework, cloud env, Jenkins) issues vs real failures.

roydahan commented 11 months ago

I don't want this status

mykaul commented 11 months ago

I don't want this status

@roydahan - please provide constructive feedback. We have today a critical issue, where we can't easily understand if failures were caused by infra or real issues. This has two implications:

We don't know the version's quality
We don't know if we improve the infra

We need a good solution here. This was an easy one.

roydahan commented 11 months ago

I already commented on the issue. There is no such a thing infra issue, if there is one, we fix and/or rerun.

We didn't accept this status long ago because we didn't want people to dismiss runs as infra issue and leave them as such. We leave them as failed "in progress", fix them and re run them. If you care about the current status for statistics, you can ask to mark these as Abort for now.

mykaul commented 11 months ago

I already commented on the issue. There is no such a thing infra issue, if there is one, we fix and/or rerun.

We didn't accept this status long ago because we didn't want people to dismiss runs as infra issue and leave them as such. We leave them as failed "in progress", fix them and re run them. If you care about the current status for statistics, you can ask to mark these as Abort for now.

So we'll abuse 'abort' for this? How would we know we've improved our infra (resiliency to failures, for example) over time?

roydahan commented 11 months ago

So we'll abuse 'abort' for this? How would we know we've improved our infra (resiliency to failures, for example) over time?

Tell me what do you call infra failure and I'll tell you if we can improve that.

spot termination isn't infra failure, it's abort. And we cannot improve that.
Jenkins failure isn't infra failure, it's abort and we improved that to the maximum.

What else failures do you see that you call them infra failures?

mykaul commented 11 months ago

So we'll abuse 'abort' for this? How would we know we've improved our infra (resiliency to failures, for example) over time?

Tell me what do you call infra failure and I'll tell you if we can improve that.

spot termination isn't infra failure, it's abort. And we cannot improve that.

Jenkins failure isn't infra failure, it's abort and we improved that to the maximum.

What else failures do you see that you call them infra failures?

SCT issue - random example from today - https://github.com/scylladb/scylla-cluster-tests/pull/6985 ) Another - dtest this time - https://github.com/scylladb/scylla-dtest/pull/3791 Any failure that is not either a new or a known bug, I classify as infra - cloud, test framework, tests themselves.

roydahan commented 11 months ago

SCT issue - random example from today - scylladb/scylla-cluster-tests#6985 )

This is exactly the reason why we don't need "Failed - infra". It's not an infra failure, it's a change in scylla that now doesn't support something, one should fix it and re-run.

Another - dtest this time - scylladb/scylla-dtest#3791

I don't see how this one is related to the discussion here. This is an improvement to dtest hoping to workaround environmental issue we don't have control with. It's not part of Argus anyway.

Any failure that is not either a new or a known bug, I classify as infra - cloud, test framework, tests themselves.

You cannot count all these as "Failed - Infra" and hope to get improve, since we expect everyone to contribute tests to SCT, and once we do it we may have more and more SCT issues, so you won't improve, you'll degrade. Also, not every issue in SCT should mark the test as "Failed - Infra", some are just issue in one of the nemesis or something like that, but the test continues and should be investigated. Hece, we use the "linked" issues, to specify also SCT issues. but the responsibility of the "job assignee" is to get the job to completion and till then, it's "Failed" (In progress").

mykaul commented 11 months ago

SCT issue - random example from today - scylladb/scylla-cluster-tests#6985 )

This is exactly the reason why we don't need "Failed - infra". It's not an infra failure, it's a change in scylla that now doesn't support something, one should fix it and re-run.

It's a change that literally took place months ago - and most tests were adjusted already.

Another - dtest this time - scylladb/scylla-dtest#3791

I don't see how this one is related to the discussion here. This is an improvement to dtest hoping to workaround environmental issue we don't have control with. It's not part of Argus anyway.

It's not part of argus, it's part of the infra we have to reliably run tests and get results for the quality of ScyllaDB.

Any failure that is not either a new or a known bug, I classify as infra - cloud, test framework, tests themselves.

You cannot count all these as "Failed - Infra" and hope to get improve, since we expect everyone to contribute tests to SCT, and once we do it we may have more and more SCT issues, so you won't improve, you'll degrade. Also, not every issue in SCT should mark the test as "Failed - Infra", some are just issue in one of the nemesis or something like that, but the test continues and should be investigated. Hece, we use the "linked" issues, to specify also SCT issues. but the responsibility of the "job assignee" is to get the job to completion and till then, it's "Failed" (In progress").

Linking SCT issues to test that we'll make as 'Fail-Infra' doesn't make anything go away - we would like to fix them just as much and reduce them. This is why they are marked as a failure, and we should have a GH issue for root cause and fix.

roydahan commented 11 months ago

SCT issue - random example from today - scylladb/scylla-cluster-tests#6985 )

This is exactly the reason why we don't need "Failed - infra". It's not an infra failure, it's a change in scylla that now doesn't support something, one should fix it and re-run.

It's a change that literally took place months ago - and most tests were adjusted already.

Right, and these tests like many others haven't run since. And once someone bump into it, he shouldn't mark it as "Fail - infra" which is an easy way out for him. One should understand the failure, send the fix and re-run. till then, it's "Failed". We as release maintainers shouldn't care why it failed until there is a scylla issue attached to it. Hence, I insisted in the past and still insist to not allow "easy way out statuses" like "Failed - Infra".

Another - dtest this time - scylladb/scylla-dtest#3791

I don't see how this one is related to the discussion here. This is an improvement to dtest hoping to workaround environmental issue we don't have control with. It's not part of Argus anyway.

It's not part of argus, it's part of the infra we have to reliably run tests and get results for the quality of ScyllaDB.

Any failure that is not either a new or a known bug, I classify as infra - cloud, test framework, tests themselves.

You cannot count all these as "Failed - Infra" and hope to get improve, since we expect everyone to contribute tests to SCT, and once we do it we may have more and more SCT issues, so you won't improve, you'll degrade. Also, not every issue in SCT should mark the test as "Failed - Infra", some are just issue in one of the nemesis or something like that, but the test continues and should be investigated. Hece, we use the "linked" issues, to specify also SCT issues. but the responsibility of the "job assignee" is to get the job to completion and till then, it's "Failed" (In progress").

Linking SCT issues to test that we'll make as 'Fail-Infra' doesn't make anything go away - we would like to fix them just as much and reduce them. This is why they are marked as a failure, and we should have a GH issue for root cause and fix.

Right, it doesn't make things go away, hence we want it to be "Failed" and want people to still hold the responsiblity to make it "Pass" or have a known scylladb issue on it.

mykaul commented 11 months ago

SCT issue - random example from today - scylladb/scylla-cluster-tests#6985 )

This is exactly the reason why we don't need "Failed - infra". It's not an infra failure, it's a change in scylla that now doesn't support something, one should fix it and re-run.

It's a change that literally took place months ago - and most tests were adjusted already.

Right, and these tests like many others haven't run since. And once someone bump into it, he shouldn't mark it as "Fail - infra" which is an easy way out for him. One should understand the failure, send the fix and re-run. till then, it's "Failed". We as release maintainers shouldn't care why it failed until there is a scylla issue attached to it. Hence, I insisted in the past and still insist to not allow "easy way out statuses" like "Failed - Infra".

Why is it any easier than marking it as failed as it is today? Why wouldn't the engineer file an issue, fix the issue, re-run the test?

Another - dtest this time - scylladb/scylla-dtest#3791

I don't see how this one is related to the discussion here. This is an improvement to dtest hoping to workaround environmental issue we don't have control with. It's not part of Argus anyway.

It's not part of argus, it's part of the infra we have to reliably run tests and get results for the quality of ScyllaDB.

Any failure that is not either a new or a known bug, I classify as infra - cloud, test framework, tests themselves.

You cannot count all these as "Failed - Infra" and hope to get improve, since we expect everyone to contribute tests to SCT, and once we do it we may have more and more SCT issues, so you won't improve, you'll degrade. Also, not every issue in SCT should mark the test as "Failed - Infra", some are just issue in one of the nemesis or something like that, but the test continues and should be investigated. Hece, we use the "linked" issues, to specify also SCT issues. but the responsibility of the "job assignee" is to get the job to completion and till then, it's "Failed" (In progress").

Linking SCT issues to test that we'll make as 'Fail-Infra' doesn't make anything go away - we would like to fix them just as much and reduce them. This is why they are marked as a failure, and we should have a GH issue for root cause and fix.

Right, it doesn't make things go away, hence we want it to be "Failed" and want people to still hold the responsiblity to make it "Pass" or have a known scylladb issue on it.

It completely hides the quality of Scylla. If those were rare issues - fine - but they aren't. The signal to noise in our tests isn't good.

roydahan commented 11 months ago

SCT issue - random example from today - scylladb/scylla-cluster-tests#6985 )

This is exactly the reason why we don't need "Failed - infra". It's not an infra failure, it's a change in scylla that now doesn't support something, one should fix it and re-run.

It's a change that literally took place months ago - and most tests were adjusted already.

Right, and these tests like many others haven't run since. And once someone bump into it, he shouldn't mark it as "Fail - infra" which is an easy way out for him. One should understand the failure, send the fix and re-run. till then, it's "Failed". We as release maintainers shouldn't care why it failed until there is a scylla issue attached to it. Hence, I insisted in the past and still insist to not allow "easy way out statuses" like "Failed - Infra".

Why is it any easier than marking it as failed as it is today? Why wouldn't the engineer file an issue, fix the issue, re-run the test?

Because when doing so, it kinds of remove the responsibility from the "reporter" / the "assignee" which now thinks, it's an infra problem, not my problem.

Another - dtest this time - scylladb/scylla-dtest#3791

I don't see how this one is related to the discussion here. This is an improvement to dtest hoping to workaround environmental issue we don't have control with. It's not part of Argus anyway.

It's not part of argus, it's part of the infra we have to reliably run tests and get results for the quality of ScyllaDB.

Any failure that is not either a new or a known bug, I classify as infra - cloud, test framework, tests themselves.

You cannot count all these as "Failed - Infra" and hope to get improve, since we expect everyone to contribute tests to SCT, and once we do it we may have more and more SCT issues, so you won't improve, you'll degrade. Also, not every issue in SCT should mark the test as "Failed - Infra", some are just issue in one of the nemesis or something like that, but the test continues and should be investigated. Hece, we use the "linked" issues, to specify also SCT issues. but the responsibility of the "job assignee" is to get the job to completion and till then, it's "Failed" (In progress").

Linking SCT issues to test that we'll make as 'Fail-Infra' doesn't make anything go away - we would like to fix them just as much and reduce them. This is why they are marked as a failure, and we should have a GH issue for root cause and fix.

Right, it doesn't make things go away, hence we want it to be "Failed" and want people to still hold the responsiblity to make it "Pass" or have a known scylladb issue on it.

It completely hides the quality of Scylla. If those were rare issues - fine - but they aren't. The signal to noise in our tests isn't good.

I disagree, I think that you misinterpret some of the failures considering them as "Infra" issue. For now in 2024.1, there are 7 failures that are marked with SCT issue. 3 of them are related to the kMS change one is (6982 & 6893 & 6959) 2 of either test issue (not infra) or not clear enough if it's a real scylla issue or not (6842 & 5825) 1 of them is people running a job without a scylla version or scylla image (6893) 1 is a bad configuration of monitor instance type that scylla stopped supporting.

mykaul commented 11 months ago

SCT issue - random example from today - scylladb/scylla-cluster-tests#6985 )

This is exactly the reason why we don't need "Failed - infra". It's not an infra failure, it's a change in scylla that now doesn't support something, one should fix it and re-run.

It's a change that literally took place months ago - and most tests were adjusted already.

Right, and these tests like many others haven't run since. And once someone bump into it, he shouldn't mark it as "Fail - infra" which is an easy way out for him. One should understand the failure, send the fix and re-run. till then, it's "Failed". We as release maintainers shouldn't care why it failed until there is a scylla issue attached to it. Hence, I insisted in the past and still insist to not allow "easy way out statuses" like "Failed - Infra".

Why is it any easier than marking it as failed as it is today? Why wouldn't the engineer file an issue, fix the issue, re-run the test?

Because when doing so, it kinds of remove the responsibility from the "reporter" / the "assignee" which now thinks, it's an infra problem, not my problem.

This assumes we don't trust our engineers to do their job. It's just as re-triggering CI without looking at why it failed of filing an issue on the failure. It's something that should be handled elsewhere.

Another - dtest this time - scylladb/scylla-dtest#3791

I don't see how this one is related to the discussion here. This is an improvement to dtest hoping to workaround environmental issue we don't have control with. It's not part of Argus anyway.

It's not part of argus, it's part of the infra we have to reliably run tests and get results for the quality of ScyllaDB.

Any failure that is not either a new or a known bug, I classify as infra - cloud, test framework, tests themselves.

You cannot count all these as "Failed - Infra" and hope to get improve, since we expect everyone to contribute tests to SCT, and once we do it we may have more and more SCT issues, so you won't improve, you'll degrade. Also, not every issue in SCT should mark the test as "Failed - Infra", some are just issue in one of the nemesis or something like that, but the test continues and should be investigated. Hece, we use the "linked" issues, to specify also SCT issues. but the responsibility of the "job assignee" is to get the job to completion and till then, it's "Failed" (In progress").

Linking SCT issues to test that we'll make as 'Fail-Infra' doesn't make anything go away - we would like to fix them just as much and reduce them. This is why they are marked as a failure, and we should have a GH issue for root cause and fix.

Right, it doesn't make things go away, hence we want it to be "Failed" and want people to still hold the responsiblity to make it "Pass" or have a known scylladb issue on it.

It completely hides the quality of Scylla. If those were rare issues - fine - but they aren't. The signal to noise in our tests isn't good.

I disagree, I think that you misinterpret some of the failures considering them as "Infra" issue. For now in 2024.1, there are 7 failures that are marked with SCT issue. 3 of them are related to the kMS change one is (6982 & 6893 & 6959) 2 of either test issue (not infra) or not clear enough if it's a real scylla issue or not (6842 & 5825) 1 of them is people running a job without a scylla version or scylla image (6893) 1 is a bad configuration of monitor instance type that scylla stopped supporting.

There were many more in 5.4. We just did not investigate (yet) enough on 2024.1

Anyway, my experience so far has been that we simply have too many infra issues.

The example of running a job without a scylla version or image - the Jenkins job doesn't have the description that a field is mandatory, for example. Even the documentation of the variables could have helped here. Not to mention we could have added a validator.

k0machi commented 9 months ago

@mykaul @roydahan Reworked, now includes only one new status - Test Error

Demo:

Darker shade of red for it.

roydahan commented 9 months ago

I think that the coloring will be confusing. Yellow / Orange (that is different from running) should be better IMO.

k0machi commented 9 months ago

@roydahan

How about this one then?

roydahan commented 9 months ago

@k0machi I see that this was merged, but was it deployed? I don't see this option in the list of available statuses...

scylladb / argus

feature(argus): New Failure Status #322