mozilla / sccache

Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Apache License 2.0
5.74k stars 542 forks source link

make jobserver hangs after invoking sccache if server isn't already spawned #2145

Open 64 opened 5 months ago

64 commented 5 months ago

Minimal reproducer: (sccache v0.7.7, GNU make v4.4.1, cargo v1.78.0-nightly 194a60b29)

$ printf 'all:\n\tsccache cc main.c' > Makefile
$ printf 'int main(){}' > main.c
$ sccache --stop-server
$ make -j2
# <----- hangs here!

WORKAROUND: Simply sccache --start-server before invoking make. Alternatively, you can invoke make with --jobserver-style=pipe.

64 commented 5 months ago

Looking into this a bit further. I'm using strace -f -yY -o log.txt make -j2 to see what's going on. Grepping for the jobserver fifo ("GMfifo") shows all operations on it: (note the 'finished' parts of some syscalls are not shown due to strace interleaving output)

log with sccache

``` 13767 mknodat(AT_FDCWD, "/tmp/GMfifo13767", S_IFIFO|0600) = 0 13767 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDONLY|O_NONBLOCK) = 3 13767 openat(AT_FDCWD, "/tmp/GMfifo13767", O_WRONLY) = 4 13767 fcntl(3, F_GETFD) = 0 13767 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 13767 fcntl(4, F_GETFD) = 0 13767 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 13767 write(4, "+", 1) = 1 13767 fcntl(3, F_GETFL) = 0x8800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE) 13767 fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0 13768 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC) = 3 13770 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC) = 3 13789 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC) = 3 13839 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC) = 3 13842 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC 13842 <... openat resumed>) = 3 13843 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC) = 3 13846 read(3, 13847 write(3, "+", 1 13852 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC) = 3 13770 close(3) = 0 13857 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC) = 3 13874 openat(AT_FDCWD, "/tmp/GMfifo13767", O_RDWR|O_CLOEXEC) = 3 13876 read(3, 13876 read(3, 13877 write(3, "+", 1 13877 write(3, "+", 1 13876 read(3, 13876 read(3, 13877 write(3, "+", 1) = 1 13877 write(3, "+", 1 13876 read(3, 13877 write(3, "+", 1 13876 read(3, 13877 write(3, "+", 1 13876 read(3, 13877 write(3, "+", 1 13857 close(3) = 0 13767 fcntl(3, F_GETFL) = 0x8800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE) 13767 fcntl(3, F_SETFL, O_RDONLY|O_LARGEFILE) = 0 13767 close(4) = 0 13767 read(3, "+", 1) = 1 13767 read(3, ```

log without sccache

``` 15030 mknodat(AT_FDCWD, "/tmp/GMfifo15030", S_IFIFO|0600) = 0 15030 openat(AT_FDCWD, "/tmp/GMfifo15030", O_RDONLY|O_NONBLOCK) = 3 15030 openat(AT_FDCWD, "/tmp/GMfifo15030", O_WRONLY) = 4 15030 fcntl(3, F_GETFD) = 0 15030 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 15030 fcntl(4, F_GETFD) = 0 15030 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 15030 write(4, "+", 1) = 1 15030 fcntl(3, F_GETFL) = 0x8800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE) 15030 fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0 15031 openat(AT_FDCWD, "/tmp/GMfifo15030", O_RDWR|O_CLOEXEC) = 3 15033 openat(AT_FDCWD, "/tmp/GMfifo15030", O_RDWR|O_CLOEXEC) = 3 15038 openat(AT_FDCWD, "/tmp/GMfifo15030", O_RDWR|O_CLOEXEC) = 3 15040 read(3, 15040 read(3, 15041 write(3, "+", 1 15041 write(3, "+", 1 15040 read(3, 15040 read(3, 15041 write(3, "+", 1) = 1 15041 write(3, "+", 1 15040 read(3, 15041 write(3, "+", 1 15040 read(3, 15041 write(3, "+", 1 15040 read(3, 15041 write(3, "+", 1 15030 fcntl(3, F_GETFL) = 0x8800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE) 15030 fcntl(3, F_SETFL, O_RDONLY|O_LARGEFILE) = 0 15030 close(4) = 0 15030 read(3, "+", 1) = 1 15030 read(3, "", 1) = 0 15030 close(3) = 0 15030 unlink("/tmp/GMfifo15030") = 0 ```

It seems like the tokens are being correctly returned to the jobserver in both cases, but make never sees the read() call return 0 because one of the sccache processes kept the file open (count 3 open calls vs 2 close calls). lsof /tmp/GMfifo15030 confirms this.

64 commented 5 months ago

It seems to be caused by the relatively new feature of GNU Make (>4.3.90) where the jobserver communication is done by a named FIFO (--jobserver-auth=fifo:/tmp/GMfifoXXXX)(commit, docs) rather than opening a pipe (--jobserver-auth=R,W).

Indeed, passing make --jobserver-style=pipe causes the reproducer in OP to exit successfully, whereas --jobserver-style=fifo hangs. Cargo seems to manage its jobserver via pipes, so issue only appears when make launches sccache, or indirectly launches it via cargo, and forces everything to used a named fifo.

It's not obvious what the right fix is though. Spawning the server daemon with the context of a jobserver sounds inherently broken to me. Instead, shouldn't the server act as if it was spawned from nothing, and clients pass their jobserver info for each compile request they make to the server?