Open milahu opened 2 years ago
bug: tokens are lost when workers crash
when workers crash without proper cleanup code they will not release their tokens and other jobs can hang forever, waiting for tokens
workaround: add proxy jobservers
┌───────────┐
│ make -j10 │
│ fds 3,4 │
└──────┬────┘
│ ▲
▼ │
┌────────┴──┐
│ fds 3,4 │
│ child_1 │
│ fds 5,6 │
└──────┬────┘
│ ▲
▼ │
┌────────┴──┐
│ fds 5,6 │
│ child_2 │
└───────────┘
child_2 aquires tokens through child_1 from make child_2 crashes and fails to release its tokens back to child_1 child_1 sees that child_2 is done, so child_1 will release the token back to make
fds 3,4
vs fds 5,6
is just an example
the connection between child_1 and child_2 can also be over fds 3,4
but the actual fd pipes are different than the pipes between make and child_1
we see the "actual fd pipes" in
ls -lv /proc/self/fd/
lrwx------. 1 1026 100 64 9 nov 18.46 0 -> /dev/pts/1
lrwx------. 1 1026 100 64 9 nov 18.46 1 -> /dev/pts/1
lrwx------. 1 1026 100 64 9 nov 18.46 2 -> /dev/pts/1
lr-x------. 1 1026 100 64 9 nov 18.46 3 -> 'pipe:[1206762]'
l-wx------. 1 1026 100 64 9 nov 18.46 4 -> 'pipe:[1206762]'
bug: tokens are lost when workers crash
see also
https://github.com/ocaml/opam/wiki/Spec-for-GNU-make-jobserver-support
The GNU jobserver protocol has a seriously shortcoming: tokens must be returned by child processes, but the jobserver does not know which processes have tokens at any given point. Tokens (and parallelism) can therefore be lost to a malfunctioning client or - worse - if a sub-process is killed. The benefit of the approach used, however, is its simplicity both in terms of implementation and resources.
This spec proposes ... an extended protocol for the jobserver which aims to deal with the GNU make shortcoming.
The jobserver itself is a named semaphore on Windows and a pair of pipes on Unix.
The Unix implementation is most easily with a separate thread which simply selects for returned tokens and immediately writes them to the other pipe to be available for other processes.
When opam is a client of another jobserver, care is required to ensure that opam returns all tokens it obtains (even on error).
Extended jobserver
For compatibility, MAKEFLAGS should be set as normal, but for a command tagged {jobserver}, opam will also communicate via OPAMJOBSERVER the fds (or HANDLEs, on Windows) of another pair of pipes. These pipes are specific to the process.
When the child process wants a token, it writes a single + to the write fd and attempts to read a character from the read fd. The jobserver responds to this by acquiring a token from the main job server (via either WaitForSingleObject or by reading the jobserver read fd) and then writes a dummy token back to the client.
When the client wishes to return the token, it writes a single - to the write fd which the jobserver uses as a signal to return one of the tokens it is holding on behalf of that client to the main jobserver.
This approach increases the complexity of the jobserver in opam, since a pair of fds is required for each process. There is a risk of fd exhaustion, but only on extremely high core count machines - we can hope that the maximum number of fds will increase as these systems become more commonplace, rather than having to increase the complexity of the jobserver further to have sub-processes managing the token pools!
i still believe its more elegant to solve this with a central TCP tokenpool server, as TCP works on all systems, and we dont need "a million fds" ("a pair of fds is required for each process")
bug: tokens are lost when workers crash
simple solution: stop playing a "zero sum game" (expect all aquired tokens to be released) and instead, allow tokens to go missing, and when cpu load drops, add new tokens to the "game"
long discussion of problems with the jobserver protocol https://github.com/ocaml/dune/pull/4331
We created our own build jobserver implementation at my day job, using Unix datagram sockets. Unfortunately Windows doesn't support those - that doesn't affect us (we don't compile on Windows), but it would be nice if the mechanism worked the same way on all platforms. Unfortunately using TCP connections is more complexity and overhead.
And we didn't do it as just a "token-grabbing" mechanism. Instead, the client requests N slots, and the server responds with how many it can actually use. And the client's request has more info than just the N - it also specifies the type of job it is and the target name, so the jobserver can make better scheduling decisions, because in practice not all jobs are equal. And we specified how OOM works as well, so we can kill+delay jobs during OOM state without failing the overall build. I think that part is key, because memory is just as much a limited resource as CPU or I/O.
BTW, we do also scan for dead processes that didn't release their jobs so we can reclaim their allocated resources, but to be honest I'm not sure it really matters... if a job failed in such a way that it didn't return the resources, then the overall build will fail anyway.
BTW, we do also scan for dead processes that didn't release their jobs so we can reclaim their allocated resources, but to be honest I'm not sure it really matters... if a job failed in such a way that it didn't return the resources, then the overall build will fail anyway.
in my case (chromium, qtwebengine) the main build process (ninja) hangs when workers crash
ninja tolerates when workers crash, and fails later when output files are missing. but when ninja is out of tokens, it cannot run the following workers, and the build hangs forever at zero cpu load
hence ...
when cpu load drops, add new tokens to the "game"
bug: tokens are lost when workers crash
gnumake jobserver version 1
https://www.gnu.org/software/make/manual/html_node/Job-Slots.html https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html https://www.gnu.org/software/make/manual/html_node/Windows-Jobserver.html
jobserver + jobclient implementations
gnumake https://github.com/stefanb2/ninja/tree/topic-tokenpool-master
jobclient implementations
https://github.com/olsner/jobclient https://github.com/milahu/gnumake-jobclient-js https://github.com/milahu/gnumake-jobclient-py
history
the jobserver feature was added to gnumake in year 1999 in this commit
bug: tokens are lost when workers crash
when workers crash without proper cleanup code they will not release their tokens and other jobs can hang forever, waiting for tokens
writing reliable cleanup code is hard its easier to write a jobserver with active monitoring of child processes
when a worker process crashes without releasing its tokens the jobserver assumes the tokens are free
limitation: child processes must inherit file descriptors
for example this fails with python's subprocess.Popen which by default does not inherit fds
the only requirement should be: inherit environment variables of the parent process (env vars like MAKEFLAGS)
other job servers
aka job queues
https://github.com/gearman/gearmand
https://github.com/celery/celery
https://github.com/beanstalkd/beanstalkd
see also:
message passing
aka message queues
https://stackoverflow.com/questions/731233/activemq-or-rabbitmq-or-zeromq-or
prototype implementation
pymake: python implementation of gnu make http://benjamin.smedbergs.us/pymake/ https://github.com/securesystemslab/pkru-safe-mozjs/tree/main/mozjs/build/pymake
https://github.com/linuxlizard/pymake Parse GNU Makefiles with Python
psutil: get process tree https://psutil.readthedocs.io/en/latest/#processes https://github.com/milahu/nix-build-profiler (example use of psutil)
requirements
work on windows → no named pipes. must support TCP (or a more abstract message-passing system like rabbitmq)
authentication. the jobserver can require workers to auth with a session cookie. the session cookie is passed via env-var to the workers
sandboxing. work in limited environments, for example in a
nix-build
sandboxactive monitoring of the process tree. libraries like psutil allow this at almost-zero cost needed to monitor "bad" clients: crashed clients who dont release their tokens clients who ignore the jobserver and produce 100% cpu load
targets
build systems: gnumake, cmake, ninja, cargo, meson, bazel ...
related: Build Systems à la Carte