Implement release blocking job criteria

spiffxp commented 5 years ago

This is an umbrella issue for followup work to https://github.com/kubernetes/sig-release/pull/346

That PR describes aspirational release blocking job criteria. This issue is intended to track the followup work, including:

~assignment of owners to all jobs~
~descriptions added to all release-master-blocking jobs~
~propose the creation of sig-foo-alerts@googlegroups.com or reuse of sig-foo-test-failures for all sigs that need to be responsive to test failures~
~the creation of the release-informing dashboard, and moving jobs out of release-master-blocking to that dashboard~
a bigquery run that generates metrics for jobs currently on the release-master-blocking dashboard

/sig release this is sig-release policy /sig testing this will be assisted by sig-testing tooling

EDIT 2019-07-23: AFAIK metrics is the only thing that remains to close this out

jberkus commented 5 years ago

Are we consolidating all non-blocking dashboards into -informing?

I'd be in favor of that.

BenTheElder commented 5 years ago

/cc

spiffxp commented 5 years ago

/milestone v1.14

I would like for us to implement this for the v1.14 release

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

justaugustus commented 5 years ago

/remove-lifecycle stale

justaugustus commented 5 years ago

/help /milestone v1.15

spiffxp commented 5 years ago

assignment of owners to all jobs

descriptions added to all release-master-blocking jobs

propose the creation of sig-foo-alerts@googlegroups.com or reuse of sig-foo-test-failures for all sigs that need to be responsive to test failures

/assign I'm handling this in https://github.com/kubernetes/sig-release/issues/441

the creation of the release-informing dashboard, and moving jobs out of release-master-blocking to that dashboard

this is done

a bigquery run that generates metrics for jobs currently on the release-master-blocking dashboard

I would recommend someone look at https://github.com/kubernetes/test-infra/tree/master/metrics for this

wojtek-t commented 5 years ago

Wrt https://github.com/kubernetes/test-infra/pull/13252#issuecomment-507142109, I wanted to share couple of my thoughts around the policy for making a test job release-blocking (please redirect me to a better place if this is not the one I should be commenting on):

The motivation is that 1.15 was actually blocked for 2 days due to scalability regression, even though the gce-master-scale-performance job wasn't marked as release-blocking. So either we need to officially say we can't block on scalability, or we need to adjust the policy so that it will work.

It finishes within 120m and runs at least once per 3 hours.

Those two things are what I believe we should adjust. Even though we're already working on speeding up our tests, it's not realistic to expect then will end-up in 2h (well, I can easy, that they for sure will NOT end within 2h even when we do everything that is planned both shorter and a bit more longer term). And running every 3h will not be possible due to huge cost of those tests. What we're targeting (https://groups.google.com/forum/#!topic/kubernetes-dev/9BhE8Pd0oAk) is to have those tests finish within 4h and run them twice a day.

What is important to say, is that we're not running those tests on release branches at all (due to costs it generated). And I actually believe that is very important thing. I really believe that we should differentiate in the policy what is allowed when cutting new minor release, vs what is allowed when cutting just patch release on the existing minor release. The requirements in my opinion can be different.

It passes 75% of all of its runs in the past week

I'm not sure I agree with this one. I think that there is significant difference between regressions and flakes - while I agree with the goal of reducing flakiness, if a regression happens (and is hard to debug/fix) we may end-up being red for quite some time.

Scalability tests are one example, but I think it can be somewhat similar for upgrade tests. I also really would like to see "soak tests" to be release blocking at some point, and those may have exactly the same problems as scalability tests.

So to summarize, my personal thoughts are that we should try to:

distinguish requirements for jobs between minor and patch releases
rethink them a bit more (I will try to think a bit more about that too) to adjust them to support longer or more expensive tests (like scalability, soak, etc.) [which doesn't necessary have to be tested for patch releases, where majority of changes are backports from master]
figure out what do we do in cases where releases is delayed (are there cases when we allow releasing with issue/regression? who and how makes the final decision etc.

We should probably meet and discuss all of the above deeper too.

alejandrox1 commented 5 years ago

/cc

BenTheElder commented 5 years ago

puts on release and testing hats

It finishes within 120m and runs at least once per 3 hours.

This is because release blocking tests are expected to pass multiple times in a row before the release happens so we know the passes aren't themselves a flake.

3 hours means that 3 passes takes 9 hours (!) making this longer is going to make that a difficult proposition.

I'm not sure I agree with this one. I think that there is significant difference between regressions and flakes - while I agree with the goal of reducing flakiness, if a regression happens (and is hard to debug/fix) we may end-up being red for quite some time.

I believe this requirement is about proving that the job functions reliably and is not flaky prior to moving it into blocking status.

Obviously if it turns red while blocking then it is giving us signal to block.

I also really would like to see "soak tests" to be release blocking at some point, and those may have exactly the same problems as scalability tests.

They did, but they're entirely too unreliable, which is why the criteria above exists. If they can be made reliable then they can block the release.

Note that "informing" exists for a reason, it isn't used for this "hard gate until N times green across the board" but it is used to "inform" the release team that they probably shouldn't release...

jberkus commented 5 years ago

Wrt kubernetes/test-infra#13252 (comment), I wanted to share couple of my thoughts around the policy for making a test job release-blocking (please redirect me to a better place if this is not the one I should be commenting on):

This is the right place.

The motivation is that 1.15 was actually blocked for 2 days due to scalability regression, even though the gce-master-scale-performance job wasn't marked as release-blocking. So either we need to officially say we can't block on scalability, or we need to adjust the policy so that it will work.

When the job was moved, it was a specific statement that we wouldn't block on the scalability jobs. I think that there is a misunderstanding here about the role of "informing". The tests in Informing are still supposed to be green, and CI signal and the SIGs should be actively working to make them green whenever they're not. Informing does not mean "ignore these tests". The difference between Informing and Blocking is "it's 24 hours until the release, do we automatically stop it because this job just turned red"?

Those two things are what I believe we should adjust. Even though we're already working on speeding up our tests, it's not realistic to expect then will end-up in 2h (well, I can easy, that they for sure will NOT end within 2h even when we do everything that is planned both shorter and a bit more longer term). And running every 3h will not be possible due to huge cost of those tests. What we're targeting (https://groups.google.com/forum/#!topic/kubernetes-dev/9BhE8Pd0oAk) is to have those tests finish within 4h and run them twice a day.

What is important to say, is that we're not running those tests on release branches at all (due to costs it generated). And I actually believe that is very important thing.

Wait, what? Can you explain?

I really believe that we should differentiate in the policy what is allowed when cutting new minor release, vs what is allowed when cutting just patch release on the existing minor release. The requirements in my opinion can be different.

s/minor/major, but yes, I agree.

It passes 75% of all of its runs in the past week

I'm not sure I agree with this one. I think that there is significant difference between regressions and flakes - while I agree with the goal of reducing flakiness, if a regression happens (and is hard to debug/fix) we may end-up being red for quite some time.

Scalability tests are one example, but I think it can be somewhat similar for upgrade tests. I also really would like to see "soak tests" to be release blocking at some point, and those may have exactly the same problems as scalability tests.

Please understand the meaning of the time/success metrics for blocking. The goal here is easy to spell out:

"We need to be able to tell within a 24-hour period whether the release is good or not, and we'd prefer to be able to tell within 8 hours."

Your adjustment suggestions fail this criterion. If a job is only running twice a day, and it is allowed to be "flaky", generating spurious failures due to race conditions 1/3 of the time, then it will take (on average) two days to determine if we have clean signal or not, and if there's a real problem, then each fix-and-test cycle takes 2 days. That's simply not acceptable if we're holding a release back.

The answer is here for the test to either not be flaky, or to run 6 times a day. Currently the blocking criteria require both; we can probably be flexible about one of the criteria, but not both at the same time. We could even set the criteria in either-or format, although I'd rather just have tests not be flaky.

This is particularly true of the scalability jobs, which as you point out are very expensive to run. I think the answer here is to look at why these tests are flaky, and try to fix that, given that each job flake is a big waste of money.

Summary: you seem to be arguing that we should change the Blocking criteria in order to suit the behavior of the current test jobs, no matter how that affects the release timeline. I'm saying no: the test jobs need to implement the blocking criteria.

alejandrox1 commented 5 years ago

/milestone v1.16

wojtek-t commented 5 years ago

To answer the above:

3 hours means that 3 passes takes 9 hours (!) making this longer is going to make that a difficult proposition.

That's why I'm saying the criteria may be different for patch vs minor release - while 9 hours is a long for patch releases, I actually don't agree it's that match when we talk about minor releases that happen once per quarter.

I believe this requirement is about proving that the job functions reliably and is not flaky prior to moving it into blocking status.

Obviously if it turns red while blocking then it is giving us signal to block.

I completely agree with the argument about flakiness - it has to be 0% or very close it. But in case of scalability I'm not aware of flakes in the past couple months - in fact most of green runs we were just luck and those should be red. All of them were actually result of some regression.

I also really would like to see "soak tests" to be release blocking at some point, and those may have exactly the same problems as scalability tests.

They did, but they're entirely too unreliable, which is why the criteria above exists. If they can be made reliable then they can block the release.

But the criteria you have are excluding soak tests for ever - you can't have a soak tests that runs for hours - it just doesn't make any sense, because it's not soak test at this point.

When the job was moved, it was a specific statement that we wouldn't block on the scalability jobs. I think that there is a misunderstanding here about the role of "informing".

That should have been communicated to SIG scalability and it wasn't - we didn't even have a chance to discuss and/or escalate.

The tests in Informing are still supposed to be green, and CI signal and the SIGs should be actively working to make them green whenever they're not. Informing does not mean "ignore these tests". The difference between Informing and Blocking is "it's 24 hours until the release, do we automatically stop it because this job just turned red"?

I actually believe we should stop. Maybe the required action is not necessary "wait for next 3 runs to be green", but "wait for people responsible for those tests to provide red/green light", but there should be a signal to block. Otherwise, having those doesn't make any sense - we can still release with regression/bug.

What is important to say, is that we're not running those tests on release branches at all (due to costs it generated). And I actually believe that is very important thing.

Wait, what? Can you explain?

A single 5k-node scalability test runs costs sth like $3000 - we are already spending too much money on that and we can't afford running it on all release branches. Given that I've never heard any regression on release branch (given that we almost all PRs are first merged to master and only then backported to release branch) - the costs are hard to justify.

I really believe that we should differentiate in the policy what is allowed when cutting new minor release, vs what is allowed when cutting just patch release on the existing minor release. The requirements in my opinion can be different.

s/minor/major, but yes, I agree.

Actually minor - we have major.minor.patch (major is always 1 since the last 4 years).

Please understand the meaning of the time/success metrics for blocking. The goal here is easy to spell out:

"We need to be able to tell within a 24-hour period whether the release is good or not, and we'd prefer to be able to tell within 8 hours."

Your adjustment suggestions fail this criterion. If a job is only running twice a day, and it is allowed to be "flaky", generating spurious failures due to race conditions 1/3 of the time, then it will take (on average) two days to determine if we have clean signal or not, and if there's a real problem, then each fix-and-test cycle takes 2 days. That's simply not acceptable if we're holding a release back.

But we're doing it anyway. So what's the point in lto ourselves that we don't do that? We just did it 2 weeks ago. Maybe the process should be different - maybe after flake it's a decision of a SIG to say if it's fixed or not. Maybe we can figure out something better. But we shouldn't try to pretend that we have cool criteria, while we are holding the release anyway. Or we should escalate by saying "we don't care about scalability" - but that's a higher level decision and I would like it to be brought in front of sig-architecture and/or steering committee.

The answer is here for the test to either not be flaky, or to run 6 times a day.

I can buy the argument about "not being flaky". The problem that we have with scalability is slightly different, which is that we have "false positives", meaning that some green runs shouldn't really be green - we just got lucky. It's a bit like with races - the fact that 90% of runs are green doesn't mean the race doesn't exist. As I mentioned abouve, I'm not aware of any flake in the last couple times where we got "unlucky and failed the tests we didn't have regression in".

Currently the blocking criteria require both; we can probably be flexible about one of the criteria, but not both at the same time. We could even set the criteria in either-or format, although I'd rather just have tests not be flaky.

The question is how you define flakiness here. If they turn to be flaky after regression in my opinion doesn't mean "flakes" - they just got broken. Which is the case we're talking about in the case of scalability.

This is particularly true of the scalability jobs, which as you point out are very expensive to run. I think the answer here is to look at why these tests are flaky, and try to fix that, given that each job flake is a big waste of money.

This is what I don't agree with - see above. I claim they are not flaky since the last 2 releases or so.

Summary: you seem to be arguing that we should change the Blocking criteria in order to suit the behavior of the current test jobs, no matter how that affects the release timeline. I'm saying no: the test jobs need to implement the blocking criteria.

First we need to answer more fundamental question:

What is more important for us: quality or predictability of releases

If there would be a common agreement that the latter, I will give up. But if we agree that quality of our releases is more important, then I actually have a very strong opinion that we should adjust the rules.

jberkus commented 5 years ago

The problem with 1.15 wasn't the criteria. It was that the scalability test jobs were failing for weeks without being fixed. They didn't start failing unexpectedly 2 days before the release. And they're failing now. I didn't actually know about the false positives; that makes things worse, not better.

BenTheElder commented 5 years ago

responding to just a few of these that I think need more discussion in particular...

That should have been communicated to SIG scalability and it wasn't - we didn't even have a chance to discuss and/or escalate.

https://github.com/kubernetes/test-infra/pull/10914#event-2093100984 you were a reviewer on the PR FWIW, along with another SIG-scalability member, with no comments ...

I don't remember who all was in the discussion, but I would also point out that perhaps the charters need tweaking, the scopes would suggest that this is a SIG-Release issue (not claiming it is or isn't, merely pointing to the existing documented scope...)

https://github.com/kubernetes/community/blob/master/sig-release/charter.md#in-scope https://github.com/kubernetes/community/blob/master/sig-scalability/charter.md#in-scope

I actually believe we should stop. Maybe the required action is not necessary "wait for next 3 runs to be green", but "wait for people responsible for those tests to provide red/green light", but there should be a signal to block. Otherwise, having those doesn't make any sense - we can still release with regression/bug.

The release-blocking dashboard is the "wait for 3 runs to be green dashboard" so this should be a different one (which is currently the informing dashboard, perhaps it should be split further)

But the criteria you have are excluding soak tests for ever - you can't have a soak tests that runs for hours - it just doesn't make any sense, because it's not soak test at this point.

This is a moot point given that we have no functioning soak tests to begin with. If we had them, I'm sure we'd figure out how to monitor them.

wojtek-t commented 5 years ago

The problem with 1.15 wasn't the criteria. It was that the scalability test jobs were failing for weeks without being fixed.

There were 10+ regressions in this release cycle, including very hard ones - as you may imagine it's not easy to debug and fix problem with slowed that memory allocations in Golang: https://github.com/kubernetes/kubernetes/issues/75833. To be clear: I'm not trying to say we are in great shape and no work is needed on our side - we're not and we are already working on making situation better - see my retrospective email on kubernetes-dev couple days ago. But there were case like this one where there weren't much more we could do given regression already happened - we couldn't revert Go, because there were number of things relying on that. [The AI from this particular thing is that we are now working with golang folks to validate the next release before it will happen and we already discovered a regression there: https://github.com/golang/go/issues/32828 ]

kubernetes/test-infra#10914 (comment) you were a reviewer on the PR FWIW, along with another SIG-scalability member, with no comments ...

Come on - I'm not able to follow all PRs I'm reviewer on - I bet I didn't see 75% of those or even more. If that is touching scalability, this should have really been communicated clearly to scalability sig.

but I would also point out that perhaps the charters need tweaking, the scopes would suggest that this is a SIG-Release issue (not claiming it is or isn't, merely pointing to the existing documented scope...)

I agree we should have this discussion.

The release-blocking dashboard is the "wait for 3 runs to be green dashboard" so this should be a different one (which is currently the informing dashboard, perhaps it should be split further)

I can buy the argument that it should be a different one. But I don't agree for those tests to be part of "informing" tab - it doesn't have any semantic, depending on the person making a decision they can either block the release or not - that's the wort situation to be in. The situation should be clear: either they block or not.

As I mentioned, we probably need a tab with a bit different criteria (maybe sth like manual-release-blocking - if there is a red run in the last N(=3?) runs, SIG reponsible for those tests provides the information whether this is release blocker or not. I'm happy to brainstorm more here, but that discussion has to happen.

But the criteria you have are excluding soak tests for ever - you can't have a soak tests that runs for hours - it just doesn't make any sense, because it's not soak test at this point.

This is a moot point given that we have no functioning soak tests to begin with. If we had them, I'm sure we'd figure out how to monitor them.

This is an argument I completely disagree. The point of having the policy is not the need to adjust it whenever some new test needs to be added. The point of having it is to make it universal enough so that it would fit different kinds of tests. TBH, even if I had time, I wouldn't start working on making soak tests pass now given the policy, because I have no guarantees that we could reasonably make them release blocking. Even though I really believe they are super important.

wojtek-t commented 5 years ago

And i would like to get back to this @jberkus point for a moment:

Summary: you seem to be arguing that we should change the Blocking criteria in order to suit the behavior of the current test jobs, no matter how that affects the release timeline. I'm saying no: the test jobs need to implement the blocking criteria.

As I mentioned previously, I completely don't buy it and we should first answer the question: What is more important for us: quality or predictability of releases

Imagine you're going to the restaurant and ordering a dinner for 6pm. And at exactly 6pm you're getting your dinner, but the meat is completely raw (it required 5 more minutes of cooking). And the waiter comes to you and says: "It's exactly 6pm and your dinner is here. We did our job." Is that what you really expect? Wouldn't you prefer to wait those 5 minutes more and get the meal you really wanted?

I claim it's exactly the situation here - if we're ignoring existence of things like scalability tests, we are risking releasing a version that is unusable at scale. I bet that given our quarterly-cadence of releases, people are happy to wait 3 or 5 days if needed and get higher-quality release than to get it 3 days earlier.

Now to the point - I'm NOT trying to say that "scalability tests should be part of release-blocking tab. period". It seems they belong to a different category. But that category doesn't exist now. And we need to work together to make something like that happen, so silently ignoring the problem.

jberkus commented 5 years ago

The problem with 1.15 wasn't the criteria. It was that the scalability test jobs were failing for weeks without being fixed.
There were 10+ regressions in this release cycle, including very hard ones - as you may imagine it's not easy to debug and fix problem with slowed that memory allocations in Golang: kubernetes/kubernetes#75833. To be clear: I'm not trying to say we are in great shape and no work is needed on our side - we're not and we are already working on making situation better - see my retrospective email on kubernetes-dev couple days ago.

I'm not saying that there weren't reasons for the jobs to be red. I'm saying that putting the jobs on master-blocking wouldn't have made a difference. People already knew that the jobs were red.

I was not on the CI signal team this cycle, so I'm not clear on why folks on the Release Team didn't know that the red jobs related to still seriously broken stuff. For that matter, I don't know why your team wasn't clear on the release date. This was a serious communcations problem, and it had nothing to do with what dashboard anything is on.

Now to the point - I'm NOT trying to say that "scalability tests should be part of release-blocking tab. period". It seems they belong to a different category. But that category doesn't exist now. And we need to work together to make something like that happen, so silently ignoring the problem.

That category is "master-informing". That's why that category exists.

"master-blocking/release-blocking" means that "if these jobs are red, we do not release". I realize we don't always fulfill that goal, but that's what we have to be moving towards, not away from. The goal is to eventually have automation that blocks release if a release-blocking test is red.

Materially, the longer-running scalability jobs will never fit that criteria.

they run against master and not the release branch, so sometimes they're only red on master
they take a long time to run, so sometimes the issues are fixed by we don't have a green run yet
there are known and hard-to-solve issues with the tests themselves, which means that sometimes a red run isn't actually a problem with production kubernetes
there are even false positives, per you

And factually, we released 1.15 with those scalability tests red. So clearly we have made a decision that they are not blocking.

The scalability jobs aren't the only jobs we care about that are like this. Upgrade jobs, soak tests (if they ever get revived), etc. Many conformance tests will be when they get reliable to be on a sig-release dashboard at all. That's why we have master-informing and release-informing. It's not "It's OK for these be red" it's "these jobs may be red if we know exactly why and decide it's OK"

wojtek-t commented 5 years ago

I was not on the CI signal team this cycle, so I'm not clear on why folks on the Release Team didn't know that the red jobs related to still seriously broken stuff. For that matter, I don't know why your team wasn't clear on the release date. This was a serious communcations problem, and it had nothing to do with what dashboard anything is on.

Our team was aware of release date - there wasn't much more we could do (given that some people including myselft were also on vacation at that time). TBH - I'm not sure there was a communication issue here - the problem is that it seems to be (or at least that is my understanding of the situation) that release team wasn't treating scalability tests as something that is supposed to pass to block release or have a confirmation from the sig that it's safe to release with an issue.

That's why we have master-informing and release-informing. It's not "It's OK for these be red" it's "these jobs may be red if we know exactly why and decide it's OK"

I think this is crucial here - is it documented somewhere? What I'm saying that the release team wasn't interpreeting those tests like that. If that's the expected semantics we should make it clear and introduce a process to ensure that the OK decision is clear (e.g. an issue that has to be responded with "ok-to-release" or sth like that by the SIG owning a given set of tests. Or sth like that. Once we have clear process of how to proceed with failing "master-informing" tests, then it would sound more reasonable.

jberkus commented 5 years ago

I think this is crucial here - is it documented somewhere?

Adding issue now.

What I'm saying that the release team wasn't interpreeting those tests like that. If that's the expected semantics we should make it clear and introduce a process to ensure that the OK decision is clear (e.g. an issue that has to be responded with "ok-to-release" or sth like that by the SIG owning a given set of tests. Or sth like that.

That can't be hard requirement, because some tests are in master-informing specifically because there isn't a responsive team behind them (unlike the performance tests). However, I think it's worth an addition to the CI signal handbook specifically about the performance tests.

spiffxp commented 5 years ago

https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md#release-informing-criteria currently says

Jobs that are considered useful or necessary to inform whether a commit on master is ready for release, but that clearly do not meet release-blocking criteria, may be placed on the sig-release-master-informing dashboard. These are often jobs that are some combination of slow, flaky, infrequent, or difficult to maintain.

These jobs may still block the release, but because they require a lot of manual, human interpretation, we choose to move them to a separate dashboard.

Where else do we need this information documented?

BenTheElder commented 5 years ago

Come on - I'm not able to follow all PRs I'm reviewer on - I bet I didn't see 75% of those or even more.

This is sort of the mechanism by which people review PRs ... I really hope you are not missing 75% or more of these 😕

If that is touching scalability, this should have really been communicated clearly to scalability sig.

What is the concrete ask here?

This is an argument I completely disagree. The point of having the policy is not the need to adjust it whenever some new test needs to be added. The point of having it is to make it universal enough so that it would fit different kinds of tests.

That's not really reasonable, policies should be expected to evolve, not be set in stone, and tomorrow I may introduce some unforeseen form of testing that we need to account for.

I also don't think we should be trying to handle soak testing at all right now. There's no useful soak testing on which to inform policy.

... have a confirmation from the sig that it's safe to release with an issue.

... SIG reponsible for those tests provides the information whether this is release blocker or not.

they run against master and not the release branch, so sometimes they're only red on master

there are known and hard-to-solve issues with the tests themselves, which means that sometimes a red run isn't actually a problem with production kubernetes

there are even false positives, per you

Semi-serious:

I'm not seeing an actual resolution in this thread, but it sounds like generally to actually know the status of scalability the release team is being asked to consult with SIG Scalability anyhow, so why not skip the bikeshed over dashboards and go straight to that?

If scalability staffed part of the CI signal work we just just read the normal release dashboard status straight from testgrid and the scale status from consulting with the scalability staff. In the comments above I see a list of cases where inspecting testgrid results for scale testing is claimed to not be sufficient anyhow.

jberkus commented 5 years ago

If scalability staffed part of the CI signal work we just just read the normal release dashboard status straight from testgrid and the scale status from consulting with the scalability staff. In the comments above I see a list of cases where inspecting testgrid results for scale testing is claimed to not be sufficient anyhow.

Yep, and this is exactly why the large scalability tests are back in Informing. I just need to update some documentation and we can close this issue.

jeefy commented 5 years ago

/assign @jberkus

wojtek-t commented 5 years ago

This is sort of the mechanism by which people review PRs ... I really hope you are not missing 75% or more of these confused

I actually bet I do if people don't ping me directly. I just opened a list of those where I was reviewer and I didn't see majority of those. If something is critical and really requires my attention, I should be at the very least assigned as approver (those I'm trying to follow, but I'm also missing some percentage of those).

That's not really reasonable, policies should be expected to evolve, not be set in stone, and tomorrow I may introduce some unforeseen form of testing that we need to account for. I also don't think we should be trying to handle soak testing at all right now. There's no useful soak testing on which to inform policy.

I disagree with that. We shouldn't create a policy that we know want work as soon as we have soak tests. Those are important enough that I really believe we will prioritize them soon.

I'm not seeing an actual resolution in this thread,

There was a 1-hour-long discussion that happened during the zoom meeting. Where I think we roughly converged. @jberkus is actually documenting the outcome of it in his PR.

BenTheElder commented 5 years ago

I actually bet I do if people don't ping me directly. I just opened a list of those where I was reviewer and I didn't see majority of those. If something is critical and really requires my attention, I should be at the very least assigned as approver (those I'm trying to follow, but I'm also missing some percentage of those).

This is a separate discussion but ... this kinda defeats the purpose of having assigned reviewers. Please consider dropping a note to contribex about why / how this doesn't work so we can fix this. :/

I disagree with that. We shouldn't create a policy that we know want work as soon as we have soak tests. Those are important enough that I really believe we will prioritize them soon.

There has been no indication that we will have soak tests.

What we call soak tests now are easily among the longest failing tests we have and likely to be removed in the near future... These have been relatively unmaintained for on the order of year(s). Who is prioritizing them? I've heard zero discussion of this in SIG-Testing or SIG-Release ...

There was a 1-hour-long discussion that happened during the zoom meeting. Where I think we roughly converged. @jberkus is actually documenting the outcome of it in his PR.

Excellent. :+1:

spiffxp commented 5 years ago

I've got an initial attempt at a dashboard that displays metrics relevant to release-blocking criteria: https://github.com/kubernetes/test-infra/issues/13879#issuecomment-521044404

http://velodrome.k8s.io/dashboard/db/job-health-release-blocking

spiffxp commented 5 years ago

I caught that the bazel jobs were postsubmits that didn't meet the "scheduled at least 3 hours" criteria so swapped them with periodics that did https://github.com/kubernetes/test-infra/pull/13907

spiffxp commented 5 years ago

The serial job takes way too long, and is failing due to timeout. I think we should kick out egregious offenders, and encourage a pattern of adding a feature-specific job if the feature is truly necessary to be release-blocking.

Opened issues to start this for

kubelet resource tracking tests: https://github.com/kubernetes/kubernetes/issues/81490
HPA tests: https://github.com/kubernetes/kubernetes/issues/81491

tpepper commented 5 years ago

/assign @tpepper

guineveresaenger commented 5 years ago

/assign @guineveresaenger

spiffxp commented 5 years ago

@msau42 is looking to split out serial storage tests into another release-blocking job as well https://github.com/kubernetes/test-infra/pull/13936

I feel like splitting tests into more parallel blocking jobs is a sound approach for now. But, it's only going to get us but so far before we run into new limits:

we pick up additional overhead of standup/teardown of yet another cluster
more jobs = more opportunity for one of them to flake, so overall reliability of master-blocking may go down
what other jobs should be running these tests? eg: skew, upgrade/downgrade, scale, other envs, etc

At some point it's worth questioning why these tests need to be release-blocking, and if there is some sort of bar they should be held to. We presumably do this for Conformance tests, though IMO it's not as rigorously measured as it could be, and relies on extensive human review.

jberkus commented 5 years ago

Not completed despite the merger of #752. Mostly because there's still some issues unanswered:

Revise Blocking criteria around flakiness - #773
Decide on procedure for jobs entering (and leaving) Blocking - #774
Document process/criteria by which we decide that Informing failures are tolerable - #775

alejandrox1 commented 5 years ago

will start with #775 /assign

alejandrox1 commented 4 years ago

/remove-help /milestone v1.17

guineveresaenger commented 4 years ago

Closing in favor of #773, #774, #775.

Thanks everyone!

kubernetes / sig-release

Implement release blocking job criteria #347