Open smg247 opened 2 months ago
/cc @stbenjam
Thanks, this is great!
The only question I have is if the job fails, if we should be able to unequivocally run it 3 more times, or only run it until it succeeds. The former would help offset the bad signal and give us more confidence in the job's reliability.
Maybe configurable?
retrigger-failed-run:
strategy: until_success | run_all
attempts: 3
interval: 6h
/cc @deads2k
Configurable would be good.
Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.
I think that it may be possible for plank to handle the retriggers
Almost certainly not plank. Plank consumes Prowjobs, should not create them (unless you'd do the re-runs as additional Pods for a single Prowjob, for which we would need to rethink big parts of e.g. artifact reporting). I believe this belongs to horologium, especially with the interval: 6h
config. We'd probably need some horologium-specific annotations on Prowjobs to recognize their position in a retest series and prevent each subsequent failure to cause a new round of retests.
There are more fun interactions to resolve, like how do the retests interact with standard interval-triggered periodics? Would they delay them?
Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.
For infrequently run jobs, we want to be able to still detect subtler regressions. If a developer makes an existing test go from 99% to 50%, we'll eventually get a failure on weekly runs -- that's our first hint there's something wrong, but we need more than 1 additional attempt to confirm. We'd get it eventually but it could take a month+. The unconditional attempts is a signal booster.
Okay, that makes sense, I see the value now :+1: It helps to amplify subtle decreases in reliability while saving resources because jobs that we think are solid may not need to run as often.
OpenShift has certain infra related periodics that run on a daily (or similar) frequency. This is only because the jobs are sometimes flaky, and the subsequent run will pass. The frequency could be reduced to weekly if there was a guarantee that the job would be retried a number of times if it fails.
A new config could be added to support automatically re-triggering a periodic ProwJob only in the case that it fails. It would accept the number of times to retry, and the interval at which to trigger the re-run. Something like the following to retrigger a failed job
3
times,6
hours apart:Implementation details: I think that it may be possible for
plank
to handle the retriggers