ocurrent / ocaml-ci

A CI for OCaml projects
https://ocaml.ci.dev
111 stars 72 forks source link

Local Opam Vars Job Causing Outages #947

Open mtelvers opened 1 month ago

mtelvers commented 1 month ago

OCaml-CI runs some jobs locally before beginning to submit jobs to the cluster. Firstly, each base image is pulled (~60 images) and then opam-vars and opam-vars (lower-bound) jobs are run on each image. On a new installation, all of the pulls and the subsequent jobs are run concurrently. The data gathered by opam-vars is relatively static and could, and in some cases, has been hard-coded into OCaml-CI. As the jobs are all run locally, they are only run on the OCaml-CI host architecture (currently AMD64) and other platforms are assumed to be the same.

https://github.com/ocurrent/ocaml-ci/blob/047de926d5b3697b218655b52b58406475f633c8/lib/platform.ml#L153-L166

The jobs are rebuilt every 30 days.

https://github.com/ocurrent/ocaml-ci/blob/047de926d5b3697b218655b52b58406475f633c8/service/conf.ml#L254

Disk space on the OCaml-CI machine is finite. A cron job runs every hour, deleting the oldest log data and maintaining the volume at 90% capacity. Cron also runs a docker system prune to clear old images. However, when a large number of jobs rebuild simultaneously, the machine can run out of space (see https://github.com/ocurrent/ocaml-ci/issues/946).

Options:

  1. Hardcode these data as they change very slowly and could already be up to 30 days out of date;
  2. Submit the jobs to OCluster, thus avoiding the space and capacity issues; or
  3. Update OCaml-CI to manage the number of concurrent jobs.
shonfeder commented 1 month ago

Were you able to get a sense for what these local jobs do? What is the result of the runs used for?

tmcgilchrist commented 1 week ago

Ideally you want to do option 2. which could remove the hardcoding of the other platforms. That needs some changes to ocluster so you can return the results of running the commands on a cluster worker.