support and eventually upgrade to an alternate jobserver protocol standard that does not leak threads, in collaboration with zig

andrewrk commented 2 weeks ago

Problem

The GNU make jobserver protocol leaks threads if a process dies while holding a token. For example:

Parent process creates the pipe and puts 2 tokens into the pipe, then spawns children A and B.
Child Process A reads a token and starts doing work, but then crashes before putting the token back.
Child Process B has 100 tasks to do, but only proceeds to do them using 1 logical core, leaving an entire core idle because it has been "leaked" by Child Process A.

It could get even worse - if another leak occurred in this scenario, the entire process tree would deadlock.

The fundamental problem is that the protocol relies on a process manually releasing resources before it dies, rather than relying on kernel-managed resources that are automatically released when the process dies.

Proposed Solution

Let's work together on a standard protocol that does not have this flaw, and then pressure GNU make, Ninja, and other third party projects to support it in future versions.

Here is an initial proposal (copied from https://github.com/ziglang/zig/issues/20274). This is something I thought of today and have not evaluated yet. I'm happy to workshop it with the Cargo team.

Initially proposed scheme based on advisory record locks

Root process opens a temporary file (perhaps with `shm_open`) and writes N bytes to the file, where N is thread pool size. This file descriptor is passed down the process tree. Each byte in the file represents an available logical core. Each thread in each process's thread pool holds an advisory record lock (`fnctl(F_SETLKW)`) on the corresponding byte while working. Advisory record locks are automatically released by the kernel when a process dies. In theory this is exactly the same overhead as reading and writing byte-sized tokens to a pipe (1 syscall each for lock and unlock), the difference being the resource is automatically released by the kernel on process death.

Alternate scheme based on the root process listening on a unix domain socket instead: https://github.com/rust-lang/cargo/issues/14102#issuecomment-2177815756

The idea would be to deprecate the legacy GNU make jobserver protocol and try to standardize various open source projects on the better protocol. In the beginning, it could be merely supported when the parent process implements this new standard, and have a flag for choosing it when Cargo is the root process in the tree. Later on, if the new standard achieves widespread adoption, then the default could change.

Notes

If there is interest, I will follow up here with performance data points and other observations as I evaluate this strategy in the Zig project.

Happy hacking!

weihanglo commented 2 weeks ago

cc @NobodyXu you might be interested.

NobodyXu commented 2 weeks ago

Interesting proposal but I have a few questions:

how would you find a byte to lock?
how would you block a process until a token is available?
how to implement this on windows?

Sorry if I sound discouraging, but I am just reading your proposals and is thinking about how to implement it.

andrewrk commented 2 weeks ago

how would you find a byte to lock?

Each process that wants to participate spawns a thread pool of size N where N is the byte size of the file. The threads are numbered sequentially from 0..N, each corresponding to a byte in the file.

how would you block a process until a token is available?

Lock:

struct flock options = {
    .l_type = F_WRLCK,
    .l_whence = SEEK_SET,
    .l_start = thread_index,
    .l_len = 1,
};
fcntl(fd, F_SETLKW, &options);

Unlock:

struct flock options = {
    .l_type = F_UNLCK,
    .l_whence = SEEK_SET,
    .l_start = thread_index,
    .l_len = 1,
};
fcntl(fd, F_SETLKW, &options);

how to implement this on windows?

Out of scope. A completely different strategy probably.

Hello71 commented 2 weeks ago

This scheme seems suspiciously quadratic in memory usage, with a high constant factor. From https://unix.stackexchange.com/questions/708104/what-is-the-minimum-amount-of-memory-required-for-starting-a-process-on-a-linux, each thread costs at least one 10 KB task_struct, plus at least 4 KiB stack, so 128 jobs 128 threads/job ~16 KiB/thread = ~262 MiB. Even if memory is cheap, that's way too much to spend on jobserver synchronization.

Also, the kernel is probably not optimized for very large numbers of threads waiting to lock different bytes in the same file. I wouldn't be surprised if there's a linked list of all waiters on each file, because it would use less memory than a tree in normal usage.

andrewrk commented 2 weeks ago

Hmm, I'm not following your math. Is this example a computer that has 128 logical cores? In this case it would be 128 threads per process. So, 16 KiB/thread 128 threads = 2 MiB per process. In a more typical case of 32 logical cores it would be 16 KiB/thread 32 threads = 512 KiB per process.

Edit: I think I understand now - your 128 jobs number is the number of processes alive in the tree at the same time. In this case the math checks out.

I think I reach a different conclusion than you however. I think 128 simultaneous active processes would realistically only occur on a system that has a high number of logical cores in the first place. A system having 128 logical cores will have a high amount of memory to match, making the 262 MiB number a tiny fraction of the whole. Meanwhile, a machine with less cores and ram will need less ram and probably spawn fewer simultaneous processes.

Real world examples, using your numbers for simultaneous proccess count:

zig's 80-core aarch64 CI server (251 GiB RAM): 80 simultaneous processes 80 threads/process 16 KiB/thread = 100 MiB (0.04%)
my AMD Ryzen 9 desktop (32 logical cores, 62 GiB RAM): 32 simultaneous processes 32 threads/process 16 KiB/thread = 16 MiB (0.03%)
a computer with 4 cores and 2 GiB RAM: 4 simultaneous processes 4 threads/process 16 KiB/thread = 256 KiB (0.01%)

Either way, it's a good point to note - status quo allows for spawning threads lazily after receiving a token. The proposed protocol requires a preallocated thread pool.

Your other point is important as well. I'll do some experiments and post the results.

andrewrk commented 2 weeks ago

Also, the kernel is probably not optimized for very large numbers of threads waiting to lock different bytes in the same file. I wouldn't be surprised if there's a linked list of all waiters on each file, because it would use less memory than a tree in normal usage.

It appears to me this hunch is correct. In the Linux source code, for example, in fs/locks.c, posix_lock_inode, I can see that it uses list_for_each_entry to iterate over locks, and based on a cursory glance it seems like that list is per-file.

It's disappointing. I haven't found any other primitives that might satisfy this use case. 24 million lines of C code in this source tree, 1.3 million commits, 23 years of heavy development, and yet we as users can't even batch work across a process tree correctly.

andrewrk commented 2 weeks ago

Alright, never mind, there was a solution right in front of my face. Here's an alternate scheme:

Root process listens on a unix domain socket. The address is passed down the process tree via environment variable. To get a thread token a child process connects to the socket. To return the token, disconnect. The root process only accepts N concurrent connections and stops accepting when it's maxed out.

When a child process dies, the OS causes the connection to end, allowing the root process to accept a new connection.

NobodyXu commented 2 weeks ago

Yeah the unix socket sounds plausible, it's also cross platform and can be doable on windows and other OSes.

It adds a bit more work to the "root process", where as the make jobserver doesn't need a server, but I think this is a reasonable design and in practice you always need a "root process" to keep it alive anyway.

The socket connection can be either blocking or non-blocking, so usable in sync or async function.

Compared to jobserver, it can't get the available tokens, though I guess that can be implemented via an extension by sending message to the server.

Also, I think the address of the unix socket would need to be passed as environment variables? E.g. JOBSERVER=unix-socket:...

mlugg commented 2 weeks ago

Compared to jobserver, it can't get the available tokens, though I guess that can be implemented via an extension by sending message to the server.

What's the actual use case for such a feature? I'm not claiming there isn't one, I just don't immediately see it. AFAICT, any process spawning jobs can just use a thread pool capped at the number of logical cores; perhaps there will be extra threads blocked at all times, but that shouldn't be a problem.

Also, I think the address of the unix socket would need to be passed as environment variables? E.g. JOBSERVER=unix-socket:...

Yep, agreed.

NobodyXu commented 2 weeks ago

What's the actual use case for such a feature?

Hmmm I can't think of one, so yeah probably isn't that important.

NobodyXu commented 2 weeks ago

Another interesting aspect of the new jobserver protocol, would be dealing with implicit global token.

Suppose that you spawn a new task after getting a token from jobserver protocol, that task has an implicit token. for its main thread.

If that task goes on to create more thread, which could happen in multi-threaded rustc, or in build-script using cc which spawns compiler processes, when spawning new thread they need to take into consideration of that implicit token.

Otherwise they may not be able to spawn new thread/process forever, if the jobserver only allows one thread/process to execute at a time (which is a rare but possible use case).

At the very least it could block it for a long time, cc uses an atomic to record implicit global token

https://github.com/rust-lang/cc-rs/blob/76c377ae360ccd75a00d38b484c0b08e5a429462/src/parallel/job_token.rs#L102

I wonder if there's any better design than this that can take care of this scenario without having to manually handle it.

mlugg commented 2 weeks ago

Suppose that you spawn a new task after getting a token from jobserver protocol, that task has an implicit token. for its main thread.

When the parent spawns its child, could the child directly inherit the FD for the connection acquired by the parent? Then, the parent could close their FD after the fork, and the child now "owns" that socket connection, and it will be automatically released when the child exits. We still need logic for this (to ensure the child inherits the FD and the parent immediately closes it), but I think having it in child spawning is mildly better than requiring every child to handle this.

EDIT: oh, the child would also need to be able to release this FD when it wanted, that's kind of your point. The FD index could be be passed in an env var.

NobodyXu commented 2 weeks ago

Yeah I think that could work.

Having the fd in child would also have another benefit:

The job token will be released whenever the child exits, instead of whenever the one that acquires the job token has reaped the child or itself exits.

The use of unix socket does seem like a cleaner design, compared to (named) pipe.

bjorn3 commented 2 weeks ago

When the parent spawns its child, could the child directly inherit the FD for the connection acquired by the parent?

That re-introduces the problem that the new named pipe jobserver protocol of make solves which the old anonymous pipe protocol suffered from: You are forced to initialize the jobserver client before any code runs that could open file descriptors. Otherwise you might misinterpret a locally opened file descriptor as an inherited one originating from the jobserver, which violates IO safety. This also necessitates jobserver::Client::from_env being unsafe.

NobodyXu commented 2 weeks ago

Hmmm yeah that's true.

andrewrk commented 2 weeks ago

I haven't been able to actually get this scheme to work in practice, on Linux: https://gist.github.com/andrewrk/d8014959638ff67ff87a183b639304e2

connect() on an un-accepted connection does not block (?) poll(POLLOUT) on a connected, un-accepted connection does not block (??)

joshtriplett commented 2 weeks ago

@andrewrk wrote:

Alright, never mind, there was a solution right in front of my face. Here's an alternate scheme:

Root process listens on a unix domain socket. The address is passed down the process tree via environment variable. To get a thread token a child process connects to the socket. To return the token, disconnect. The root process only accepts N concurrent connections and stops accepting when it's maxed out.

When a child process dies, the OS causes the connection to end, allowing the root process to accept a new connection.

This seems like a great solution that's reasonably portable; it'd work on any UNIX, and it'd work on modern Windows.

@andrewrk wrote:

I haven't been able to actually get this scheme to work in practice, on Linux: https://gist.github.com/andrewrk/d8014959638ff67ff87a183b639304e2

connect() on an un-accepted connection does not block (?) poll(POLLOUT) on a connected, un-accepted connection does not block (??)

It'd be an extra syscall, but one thing that should work portably on any UNIX system would be to connect and then read one byte, and have the jobserver always write a single byte to each socket after accepting it.

If that turns out to be the only way to get this to work, then another possible protocol, which might be faster for multi-job processes but marginally slower for single-job processes, would be for clients of this protocol to first write a byte to the connected socket for each job it wants to start, and then read a byte per job. This would allow connecting a socket once and then requesting multiple jobs from it, rather than connecting a socket per job. For a multi-job process, that reduces the overhead to a write and read per job (both of which may be possible to batch), and it doesn't require the overhead of updating the fd table repeatedly. But for a single-job process, this would mean one more syscall (a write) on top of what's now already a socket/connect/read.

It's unfortunate that this protocol already requires 3 syscalls to start each job, and another to close it (if not closing it via process exit). This isn't substantially worse than the 2+1 of the current jobserver protocol (which requires open and read and a subsequent write) or the 1+1 of the old one (inherit fd, do a read and subsequent write), but it's not ideal. However, I don't see any obvious way to do substantially better, even without taking portability into account, while preserving the property of automatically releasing on close.

andrewrk commented 6 days ago

I am observing one thread blocking on connect(), and then another thread calls shutdown() on the same sockfd, the first thread continues to block on the connect syscall: https://gist.github.com/andrewrk/bf46cff35d34692558563f75bc8d0e62

Any ideas on how to solve this? The problem is on deinitialization; the main thread needs a way to wake up the worker threads that are blocking on connect().

joshtriplett commented 6 days ago

@andrewrk Based on the earlier experiment, I had thought connect never blocked in the first place, which is why we need the read.

Investigating further, it looks like eventually connect will block. depending on the backlog size specified in listen. (Looking at the zig PR thread, it looks like you found something similar.)

For instance, if you have a backlog of 2, and your listener accepts only one connection and loops forever (never accepting other connections), the first connect will succeed (being accepted), the next 3 (?!) will successfully connect and block in read, and after that, additional connect attempts will block.

Likewise, with a backlog of 1, the first connection succeeds (being accepted), the next 2 (?!) will successfully connect and block in read, and after that, additional connect attempts will block.

This behavior seems weird enough that it probably can't be counted on, and I'd be shocked if it doesn't vary by platform, so I'm not sure if counting on it is a good idea. I'm not sure if it's viable to count on setting the backlog to a large number (in any case you shouldn't set it to more than SOMAXCONN), nor should you count on any particular backlog behavior from the jobserver (since many implementations are unlikely to change the default). I think you're going to have to allow for either connect or read blocking.

From what I can tell, the behavior of non-blocking connect is also both finicky and non-standard here; I don't think you're likely to be able to use poll to wait for connect to succeed. On Linux at least, non-blocking UNIX sockets just fail connect with EAGAIN rather than returning EINPROGRESS, so you I don't think you can poll for completion of connect. This seems like a substantial deficiency. I haven't found any reasonable way to do a non-blocking connect to a UNIX socket on Linux; there are various unreasonable ways that require jumping through substantial hoops (e.g. requiring io_uring, or spawning a thread/process).

You could use the sockopt SO_SNDTIMEO, which is standardized by POSIX, but I'm not sure all platforms apply that timeout to connect and not just send/write. In any case, this would cause you to have to repeatedly wake up when the timeout expires and re-invoke connect, which seems wasteful, and it'd incur a delay before you can kill worker threads. That might be acceptable, though, since you expect most blocks to occur on read (for which poll works) rather than connect. But in any case, you'd have to verify that it applies to connect on all targets.

I don't think you can portably send a signal to a specific thread (rather than a process), leaving aside all the other reasons not to want to use signals. pthread_cancel similarly seems like a bad idea.

I expect the behavior of closing the fd (or race-free equivalent like dup2ing /dev/null over the fd) would be finicky and non-standard. From reading the Linux code, I think that if connect has reached the point of waiting to connect, it's just going to keep waiting and not care about its socket having been closed.

You could move the blocking from the worker threads themselves to a separate thread that handles making the connections, which would allow you to terminate the worker threads (since they could wait on something like a channel), but that just moves the problem to the connection thread.

If you want something to be able to block in either connect or read, but you also want to be able to bail out and end the worker pool by some means other than your process ending, I don't have anything else obvious to suggest; non-blocking connect doesn't work on UNIX sockets, and blocking connect can't easily be terminated.

joshtriplett commented 5 days ago

@andrewrk A correction: it looks like signals might be viable after all. https://www.man7.org/linux/man-pages/man3/pthread_kill.3.html exists. You could install a signal handler, invoke pthread_kill, the handler will run on that thread, the handler can set a flag, and after the handler returns, connect should return with EINTR and the code that invoked connect can check the flag.

It'd be mildly painful for something general-purpose like zig's worker pool, you'd probably want to let the user pass in a signal number that it's OK for you to use (which can have a reasonable default), but it should work.

rust-lang / cargo