kubernetes-sigs / prow

Prow is a Kubernetes based CI/CD system developed to serve the Kubernetes community. This repository contains Prow source code and Hugo sources for Prow documentation site.
https://docs.prow.k8s.io
Apache License 2.0
119 stars 98 forks source link

Config to automatically Re-trigger failed periodics #268

Open smg247 opened 1 month ago

smg247 commented 1 month ago

OpenShift has certain infra related periodics that run on a daily (or similar) frequency. This is only because the jobs are sometimes flaky, and the subsequent run will pass. The frequency could be reduced to weekly if there was a guarantee that the job would be retried a number of times if it fails.

A new config could be added to support automatically re-triggering a periodic ProwJob only in the case that it fails. It would accept the number of times to retry, and the interval at which to trigger the re-run. Something like the following to retrigger a failed job 3 times, 6 hours apart:

retrigger-failed-run:
  attempts: 3
  interval: 6h

Implementation details: I think that it may be possible for plank to handle the retriggers

smg247 commented 1 month ago

/cc @stbenjam

stbenjam commented 1 month ago

Thanks, this is great!

The only question I have is if the job fails, if we should be able to unequivocally run it 3 more times, or only run it until it succeeds. The former would help offset the bad signal and give us more confidence in the job's reliability.

Maybe configurable?

retrigger-failed-run:
  strategy: until_success | run_all
  attempts: 3
  interval: 6h

/cc @deads2k

deads2k commented 1 month ago

Configurable would be good.

  1. Sometimes we want definitely three more times
  2. Sometimes we want run up to three more times for a success.
petr-muller commented 1 month ago

Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.

I think that it may be possible for plank to handle the retriggers

Almost certainly not plank. Plank consumes Prowjobs, should not create them (unless you'd do the re-runs as additional Pods for a single Prowjob, for which we would need to rethink big parts of e.g. artifact reporting). I believe this belongs to horologium, especially with the interval: 6h config. We'd probably need some horologium-specific annotations on Prowjobs to recognize their position in a retest series and prevent each subsequent failure to cause a new round of retests.

There are more fun interactions to resolve, like how do the retests interact with standard interval-triggered periodics? Would they delay them?

stbenjam commented 1 month ago

Retest until success seems universally useful but I'm not a fan of having logic where a single passing job is good but a single failing job results in multiple passing jobs later.

For infrequently run jobs, we want to be able to still detect subtler regressions. If a developer makes an existing test go from 99% to 50%, we'll eventually get a failure on weekly runs -- that's our first hint there's something wrong, but we need more than 1 additional attempt to confirm. We'd get it eventually but it could take a month+. The unconditional attempts is a signal booster.

petr-muller commented 1 month ago

Okay, that makes sense, I see the value now :+1: It helps to amplify subtle decreases in reliability while saving resources because jobs that we think are solid may not need to run as often.