ideas for a gnumake jobserver version 2

milahu commented 2 years ago

gnumake jobserver version 1

https://www.gnu.org/software/make/manual/html_node/Job-Slots.html https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html https://www.gnu.org/software/make/manual/html_node/Windows-Jobserver.html

jobserver + jobclient implementations

gnumake https://github.com/stefanb2/ninja/tree/topic-tokenpool-master

jobclient implementations

https://github.com/olsner/jobclient https://github.com/milahu/gnumake-jobclient-js https://github.com/milahu/gnumake-jobclient-py

history

the jobserver feature was added to gnumake in year 1999 in this commit

bug: tokens are lost when workers crash

when workers crash without proper cleanup code they will not release their tokens and other jobs can hang forever, waiting for tokens

writing reliable cleanup code is hard its easier to write a jobserver with active monitoring of child processes

when a worker process crashes without releasing its tokens the jobserver assumes the tokens are free

limitation: child processes must inherit file descriptors

for example this fails with python's subprocess.Popen which by default does not inherit fds

import subprocess
subprocess.Popen(
  "ls -lv /proc/self/fd/",
  shell=True,
  #close_fds=True, # default: dont inherit fds of parent process
  close_fds=False, # inherit fds of parent process
)

the only requirement should be: inherit environment variables of the parent process (env vars like MAKEFLAGS)

other job servers

aka job queues

https://github.com/gearman/gearmand

https://github.com/celery/celery

https://github.com/beanstalkd/beanstalkd

see also:

message passing

aka message queues

https://stackoverflow.com/questions/731233/activemq-or-rabbitmq-or-zeromq-or

These 3 messaging technologies have different approaches on building distributed systems :

RabbitMQ is one of the leading implementation of the AMQP protocol (along with Apache Qpid). Therefore, it implements a broker architecture, meaning that messages are queued on a central node before being sent to clients. This approach makes RabbitMQ very easy to use and deploy, because advanced scenarios like routing, load balancing or persistent message queuing are supported in just a few lines of code. However, it also makes it less scalable and “slower” because the central node adds latency and message envelopes are quite big.

ZeroMq is a very lightweight messaging system specially designed for high throughput/low latency scenarios like the one you can find in the financial world. Zmq supports many advanced messaging scenarios but contrary to RabbitMQ, you’ll have to implement most of them yourself by combining various pieces of the framework (e.g : sockets and devices). Zmq is very flexible but you’ll have to study the 80 pages or so of the guide (which I recommend reading for anybody writing distributed system, even if you don’t use Zmq) before being able to do anything more complicated than sending messages between 2 peers.

ActiveMQ is in the middle ground. Like Zmq, it can be deployed with both broker and P2P topologies. Like RabbitMQ, it’s easier to implement advanced scenarios but usually at the cost of raw performance. It’s the Swiss army knife of messaging :-).

prototype implementation

pymake: python implementation of gnu make http://benjamin.smedbergs.us/pymake/ https://github.com/securesystemslab/pkru-safe-mozjs/tree/main/mozjs/build/pymake

https://github.com/linuxlizard/pymake Parse GNU Makefiles with Python

psutil: get process tree https://psutil.readthedocs.io/en/latest/#processes https://github.com/milahu/nix-build-profiler (example use of psutil)

requirements

work on windows → no named pipes. must support TCP (or a more abstract message-passing system like rabbitmq)

authentication. the jobserver can require workers to auth with a session cookie. the session cookie is passed via env-var to the workers

sandboxing. work in limited environments, for example in a nix-build sandbox

active monitoring of the process tree. libraries like psutil allow this at almost-zero cost needed to monitor "bad" clients: crashed clients who dont release their tokens clients who ignore the jobserver and produce 100% cpu load

targets

build systems: gnumake, cmake, ninja, cargo, meson, bazel ...

related: Build Systems à la Carte

6.2 Parallelism We have given simple implementations assuming a single thread of execution, but all the build systems we address can actually build independent keys in parallel. While it complicates the model, the complications can be restricted exclusively to the scheduler: (1) The topological scheduler can build the full dependency graph, and whenever all dependencies of a task are complete, the task itself can be started. (2) The restarting scheduler can be made parallel in a few ways, but the most direct is to have n threads reading keys from the build queue. As before, if a key requires a dependency not yet built, it is moved to the end ś the difference is that sometimes keys will be moved to the back of the queue not because they are out of date but because of races with earlier tasks that had not yet finished. As a consequence, over successive runs, potentially racey dependencies will be separated, giving better parallelism over time. (3) The suspending scheduler can be made parallel by starting static dependencies of a Task in parallel, while dynamic dependencies are being resolved, using the strategy described by Marlow et al. [2014]. The actual implementation of the parallel schedulers is not overly onerous, but neither is it beautiful or informative.

milahu commented 2 years ago

bug: tokens are lost when workers crash

when workers crash without proper cleanup code they will not release their tokens and other jobs can hang forever, waiting for tokens

workaround: add proxy jobservers

┌───────────┐
│ make -j10 │
│  fds 3,4  │
└──────┬────┘
       │ ▲
       ▼ │
┌────────┴──┐
│  fds 3,4  │
│  child_1  │
│  fds 5,6  │
└──────┬────┘
       │ ▲
       ▼ │
┌────────┴──┐
│  fds 5,6  │
│  child_2  │
└───────────┘

child_2 aquires tokens through child_1 from make child_2 crashes and fails to release its tokens back to child_1 child_1 sees that child_2 is done, so child_1 will release the token back to make

fds 3,4 vs fds 5,6 is just an example the connection between child_1 and child_2 can also be over fds 3,4 but the actual fd pipes are different than the pipes between make and child_1

we see the "actual fd pipes" in

ls -lv /proc/self/fd/

lrwx------. 1 1026 100 64  9 nov 18.46 0 -> /dev/pts/1
lrwx------. 1 1026 100 64  9 nov 18.46 1 -> /dev/pts/1
lrwx------. 1 1026 100 64  9 nov 18.46 2 -> /dev/pts/1
lr-x------. 1 1026 100 64  9 nov 18.46 3 -> 'pipe:[1206762]'
l-wx------. 1 1026 100 64  9 nov 18.46 4 -> 'pipe:[1206762]'

milahu commented 2 years ago

bug: tokens are lost when workers crash

The GNU jobserver protocol has a seriously shortcoming: tokens must be returned by child processes, but the jobserver does not know which processes have tokens at any given point. Tokens (and parallelism) can therefore be lost to a malfunctioning client or - worse - if a sub-process is killed. The benefit of the approach used, however, is its simplicity both in terms of implementation and resources.

This spec proposes ... an extended protocol for the jobserver which aims to deal with the GNU make shortcoming.

The jobserver itself is a named semaphore on Windows and a pair of pipes on Unix.

The Unix implementation is most easily with a separate thread which simply selects for returned tokens and immediately writes them to the other pipe to be available for other processes.

When opam is a client of another jobserver, care is required to ensure that opam returns all tokens it obtains (even on error).

Extended jobserver

For compatibility, MAKEFLAGS should be set as normal, but for a command tagged {jobserver}, opam will also communicate via OPAMJOBSERVER the fds (or HANDLEs, on Windows) of another pair of pipes. These pipes are specific to the process.

When the child process wants a token, it writes a single + to the write fd and attempts to read a character from the read fd. The jobserver responds to this by acquiring a token from the main job server (via either WaitForSingleObject or by reading the jobserver read fd) and then writes a dummy token back to the client.

When the client wishes to return the token, it writes a single - to the write fd which the jobserver uses as a signal to return one of the tokens it is holding on behalf of that client to the main jobserver.

This approach increases the complexity of the jobserver in opam, since a pair of fds is required for each process. There is a risk of fd exhaustion, but only on extremely high core count machines - we can hope that the maximum number of fds will increase as these systems become more commonplace, rather than having to increase the complexity of the jobserver further to have sub-processes managing the token pools!

i still believe its more elegant to solve this with a central TCP tokenpool server, as TCP works on all systems, and we dont need "a million fds" ("a pair of fds is required for each process")

milahu commented 2 years ago

bug: tokens are lost when workers crash

simple solution: stop playing a "zero sum game" (expect all aquired tokens to be released) and instead, allow tokens to go missing, and when cpu load drops, add new tokens to the "game"

milahu commented 2 years ago

long discussion of problems with the jobserver protocol https://github.com/ocaml/dune/pull/4331

hadrielk commented 2 years ago

We created our own build jobserver implementation at my day job, using Unix datagram sockets. Unfortunately Windows doesn't support those - that doesn't affect us (we don't compile on Windows), but it would be nice if the mechanism worked the same way on all platforms. Unfortunately using TCP connections is more complexity and overhead.

And we didn't do it as just a "token-grabbing" mechanism. Instead, the client requests N slots, and the server responds with how many it can actually use. And the client's request has more info than just the N - it also specifies the type of job it is and the target name, so the jobserver can make better scheduling decisions, because in practice not all jobs are equal. And we specified how OOM works as well, so we can kill+delay jobs during OOM state without failing the overall build. I think that part is key, because memory is just as much a limited resource as CPU or I/O.

BTW, we do also scan for dead processes that didn't release their jobs so we can reclaim their allocated resources, but to be honest I'm not sure it really matters... if a job failed in such a way that it didn't return the resources, then the overall build will fail anyway.

milahu commented 2 years ago

BTW, we do also scan for dead processes that didn't release their jobs so we can reclaim their allocated resources, but to be honest I'm not sure it really matters... if a job failed in such a way that it didn't return the resources, then the overall build will fail anyway.

in my case (chromium, qtwebengine) the main build process (ninja) hangs when workers crash

ninja tolerates when workers crash, and fails later when output files are missing. but when ninja is out of tokens, it cannot run the following workers, and the build hangs forever at zero cpu load

hence ...

when cpu load drops, add new tokens to the "game"

milahu commented 1 year ago

bug: tokens are lost when workers crash

milahu / gnumake-tokenpool

ideas for a gnumake jobserver version 2 #1