ocurrent / opam-repo-ci

An OCurrent pipeline for testing submissions to opam-repository
Apache License 2.0
20 stars 22 forks source link

Distinguish failing CI runs from CI errors #328

Open shonfeder opened 3 months ago

shonfeder commented 3 months ago

A CI pipeline can fail because a build or test step returns a negative result, or because some error in the CI pipeline logic. We currently don't distinguish these outcomes in most cases. Ignoring this distinction has the following known down-sides:

  1. Users will see job failures, and don't learn that the CI is suffering from internal errors until they inspect the logs.
  2. While we collect metrics on the number of failed CI jobs, we cannot differentiate these from errors that would indicate sporadic or pervasive failures in the infrastructure and services.
  3. We have no way of restarting jobs that failed due to an infrastructure error after that error has been repaired.

(2) bit us last week, when we failed to detect https://github.com/ocaml/infrastructure/issues/128 until it was so wide spread that users where noticing the failures. If we had monitoring that alerted us to errors in the pipeline, we could have seen this coming much earlier. Incorporating an error status into the metrics sent to Grafana would make these failures clearly visible.

To address (1), we can send an error status to GitHub. The current API supports the following statuses:

error, failure, pending, success

But, iiuc, we don't use the error status in our reporting:

https://github.com/ocurrent/opam-repo-ci/blob/97d42b7675b1de2400f167f9b13f5b04116cb541/service/github.ml#L21-L23

To address (2), we should send error results to Grafana:

https://github.com/ocurrent/opam-repo-ci/blob/97d42b7675b1de2400f167f9b13f5b04116cb541/service/metrics.ml#L34-L37

To address (3), we start by recording failures in the job index, which doesn't currently differentiate failures from errors:

https://github.com/ocurrent/opam-repo-ci/blob/97d42b7675b1de2400f167f9b13f5b04116cb541/lib/index.ml#L114

hannesm commented 3 months ago

I propose to rename "error" to "internal error" or "internal CI error". And once this build status has been achieved, maybe an automatic restart of the job should be scheduled (surely with an exponential backoff to avoid CI going crazy)? Same could be done for "ocaml-ci".