oxidecomputer / buildomat

a software build labour-saving device
Mozilla Public License 2.0
55 stars 2 forks source link

worker agent on linux might be getting oomkilled when it's the job's fault instead #64

Open iliana opened 1 week ago

iliana commented 1 week ago

I think I'm seeing a worker agent experienced a fatal error; aborting job error on this run because the worker agent is getting oomkilled on Linux:

https://buildomat.eng.oxide.computer/wg/0/details/01JA5Z0YAABH97EZSWA21ZYMM7/dOyWW4nBzXdj1VMvNjVicWxjHQqHQlHElpGIaVzF4AojHD2t/01JA5Z1D0C5BGA3FKYBRVB60Q9

This is occurring after TestLint/TestErrCheck, which runs a command that completely exhausts the 32 GB of RAM on a machine I'm debugging this on. I wonder if the entire cgroup is getting axed and not just the underlying job process.

Does the agent set its oomkiller priority at all? (I think there's like three ways to do this now on Linux because of course there is.)

jclulow commented 1 week ago

It does not, but I would be happy to make use of whatever cgroup/oomkiller APIs make sense for a control agent that should absolutely not die!

jclulow commented 1 week ago

@iliana I think what I would like to have is:

Is there a good API for listening to OOM kill events or am I going to have to tail a log or have a journalctl child or something 😅

iliana commented 1 week ago

I have never looked into this beyond the small bit of log message that lives in my head where OpenSSH tells you it's setting it's oom_score_adj at startup. (In the case of OpenSSH, obviously it's doing something to make sure the shells it spawns as users that are logging in are not inheriting that oom_score_adj.)

I assume the "right" way to do this would be for the agent to create a new cgroup and run the program inside the cgroup, configuring it to be the first to go when the RAM runs out. What is actually done beyond that point to understand why the process was killed is not something I'm immediately aware of.

There's also systemd-oomd, I'm not sure how recent it is (relevant for the Ubuntu images older than 24.04), but it apparently exists because the Linux kernel's oomkiller logic leaves a lot to be desired and can't really dynamically take things into account beyond the one knob of oom_score_adj; one knob does not a policy make.