scylladb / argus

Apache License 2.0
4 stars 10 forks source link

Need to add FAILED-INFRA vs. FAILED-BUG (and perhaps FAILED-TEST) failures #316

Closed mykaul closed 9 months ago

mykaul commented 11 months ago

Right now, it's quite impossible to measure how stable our infra is, our tests are solid and how many bugs we actually encouter. The dashboard today does not allow us to easily distinguished between failure modes. If we look at 5.4 right now, it look catastrophic, and I don't believe that's the case: image

44% failure is terrible! But we know it's inaccurate, since a failure could be of either causes. We need more granularity.

k0machi commented 10 months ago

Some SCT related changes would be required (and other infras still need to have full API implemented for them, like #289)

mykaul commented 10 months ago

Some SCT related changes would be required (and other infras still need to have full API implemented for them, like #289)

Unsure why - the decision on failure is on the engineer investigating the run, no? I don't care the initial one to be FAILED-BUG, as long as I can manually move it to FAILED-INFRA.

k0machi commented 10 months ago

Some SCT related changes would be required (and other infras still need to have full API implemented for them, like #289)

Unsure why - the decision on failure is on the engineer investigating the run, no? I don't care the initial one to be FAILED-BUG, as long as I can manually move it to FAILED-INFRA.

Ah, in that case it would be simpler, I was thinking we could actually catch infra failure (for example Spot Termination Error) vs test failure (The current "Failed" indicator logic)

mykaul commented 10 months ago

In the future, sure - we can catch spot termination for example. That's future.

fruch commented 10 months ago

I rather suggest it would be based on issues attach

Engineer can take a call and say considered pass or not

If you want stats, one job can have a coredump, and an infra issue.

Adding such status is not gonna give you a clear picture if people are not gonna update it

fruch commented 10 months ago

Anyhow I'm calling @roydahan opinion on it as well

roydahan commented 10 months ago

There is no such thing as Failed - infra. If something failed due to infra issue, it needs to be solved and rerun. Hence there is no point to hold such a state.

mykaul commented 10 months ago

There is no such thing as Failed - infra. If something failed due to infra issue, it needs to be solved and rerun. Hence there is no point to hold such a state.

How you you classify spot termination then?

roydahan commented 10 months ago

I classify them as failed and one need to rerun them. It's not entering the statistics anyway, only the last run is part of the statistics you see in the top bar.

fruch commented 9 months ago

@roydahan

again, we do want this state, but let's agree first on it's name (i.e. you didn't like failed-infra), we are open to suggestion

roydahan commented 9 months ago

We would like to have instead of "Failed-Infra" status that is called "Test Error" and will be marked as different color (orange?) This can be introduced now, later I would like it to automatically be set with failures we will define, like: "SpotTermination".