icecc / icecream

Distributed compiler with a central scheduler to share build load
GNU General Public License v2.0
1.6k stars 252 forks source link

iceccd is not reaping dead processes in good time #619

Open jimis opened 1 year ago

jimis commented 1 year ago

A recent issue appeared recently on my OpenSUSE-Tumbleweed bleeding edge system, and all the jobs icecc sends to other nodes are failing. But the biggest problem is that they are failing really slow, timing out basically before they re-run locally.

On my local system:

ICECC[5184] 2023-04-27 21:20:36: <Transfer Environment>
ICECC[5184] 2023-04-27 21:20:36: sent 32884804 bytes (99%)
ICECC[5184] 2023-04-27 21:20:36: Verified host 10.9.70.26 for environment 56e7bcc88b541ddf314dacc7a22a9e23 (x86_64)
ICECC[5184] 2023-04-27 21:20:36: </Transfer Environment: 318ms>
ICECC[5184] 2023-04-27 21:20:36: <send compile_file>
ICECC[5184] 2023-04-27 21:20:36: </send compile_file: 0ms>
ICECC[5184] 2023-04-27 21:20:36: <write_fd_to_server from cpp>
ICECC[6043] 2023-04-27 21:20:36: preparing source to send: /usr/bin/c++    [...]
ICECC[5184] 2023-04-27 21:20:36: sent 308295 bytes (18%)
ICECC[5184] 2023-04-27 21:20:36: </write_fd_to_server from cpp: 126ms>
ICECC[5184] 2023-04-27 21:20:36: <wait for cpp>
ICECC[5184] 2023-04-27 21:20:36: </wait for cpp: 0ms>
ICECC[5184] 2023-04-27 21:20:36: <wait for cs>
[...waiting...]
ICECC[5184] 2023-04-27 21:21:37: </wait for cs: 60103ms>
ICECC[5184] 2023-04-27 21:21:37: the server ran out of memory, recompiling locally
ICECC[5184] 2023-04-27 21:21:37: local build forced by remote exception: Error 101 - the server ran out of memory, recompiling locally

On the server side, it does not appear like it run out of memory. In debug mode I capture the failure as following:

Apr 27 21:20:36: remote compile arguments:    [...]
Apr 27 21:20:36: <parent, waiting>
[...waiting...]
Apr 27 21:21:37: timeout while reading preprocessed file
Apr 27 21:21:37: compiler produced stderr output:
Apr 27 21:21:37: cc1plus: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory
Apr 27 21:21:37: [23408] 2023-04-27 19:21:37: Remote compilation exited with exit code 1
Apr 27 21:21:37: [23408] 2023-04-27 19:21:37: </parent, waiting: 60228ms>

The crash of cc1plus seems to happen very quickly but iceccd does not reap the dead process immediately. Until that minute elapses, I can see in the process table a zombie process named [g++] defunct.

Version of icecream on both systems:

$ rpm -q icecream
icecream-1.4.0-2.3.x86_64

P.S. The secondary issue is the failure finding libz.so.1 which causes the crash. I'm assuming icecream is failing to transfer all the dependencies that the binary needs. Could be related to the recently enabled hwcaps optimisations in OpenSUSE. I see that my cc1plus binaries depend on:

$ ldd /usr/lib64/gcc/x86_64-suse-linux/13/cc1plus | grep libz.so
        libz.so.1 => /lib64/glibc-hwcaps/x86-64-v3/libz.so.1.2.13 (0x00007f932ddd7000)
jimis commented 1 year ago

P.S. The secondary issue is the failure finding libz.so.1 which causes the crash. I'm assuming icecream is failing to transfer all the dependencies that the binary needs. Could be related to the recently enabled hwcaps optimisations in OpenSUSE.

This is probably fixed by https://github.com/icecc/icecream/pull/602.

The primary issue is still valid, a crashed process is not reaped in time by iceccd.