Open shonfeder opened 4 months ago
I propose to rename "error" to "internal error" or "internal CI error". And once this build status has been achieved, maybe an automatic restart of the job should be scheduled (surely with an exponential backoff to avoid CI going crazy)? Same could be done for "ocaml-ci".
A CI pipeline can fail because a build or test step returns a negative result, or because some error in the CI pipeline logic. We currently don't distinguish these outcomes in most cases. Ignoring this distinction has the following known down-sides:
(2) bit us last week, when we failed to detect https://github.com/ocaml/infrastructure/issues/128 until it was so wide spread that users where noticing the failures. If we had monitoring that alerted us to errors in the pipeline, we could have seen this coming much earlier. Incorporating an
error
status into the metrics sent to Grafana would make these failures clearly visible.To address (1), we can send an error status to GitHub. The current API supports the following statuses:
But, iiuc, we don't use the
error
status in our reporting:https://github.com/ocurrent/opam-repo-ci/blob/97d42b7675b1de2400f167f9b13f5b04116cb541/service/github.ml#L21-L23
To address (2), we should send error results to Grafana:
https://github.com/ocurrent/opam-repo-ci/blob/97d42b7675b1de2400f167f9b13f5b04116cb541/service/metrics.ml#L34-L37
To address (3), we start by recording failures in the job index, which doesn't currently differentiate failures from errors:
https://github.com/ocurrent/opam-repo-ci/blob/97d42b7675b1de2400f167f9b13f5b04116cb541/lib/index.ml#L114