icecc / icecream

Distributed compiler with a central scheduler to share build load
GNU General Public License v2.0
1.6k stars 252 forks source link

running in docker container, iceccd crashes with Assertion `msg->job_id == cl->job_id' failed. #452

Open jessesung opened 5 years ago

jessesung commented 5 years ago

With -vvvv: main.cpp:1276: bool Daemon::handle_job_done(Client, JobDoneMsg): Assertion `msg->job_id == cl->job_id' failed.

It seems this doesn't happen when running daemon with "-m 0".

Tested with icecc 1.2-1 package in Ubuntu.

jessesung commented 5 years ago

Daemon runs as: #/usr/sbin/iceccd -m `getconf _NPROCESSORS_ONLN` --no-remote -d -n <netname> -N <hostname>

llunak commented 5 years ago

With the above commit production build should not longer abort on assertions. But I don't know how to reproduce your actual error, can you please provide more specific steps on how to reproduce it?

jessesung commented 5 years ago

@llunak Thanks for your reply!

The steps to reproduce are like:

  1. Create a docker image with ubuntu:bionic as the base image, and upgrade everything to the latest. (I'm also able to reproduce the issue with ubuntu:xenial, ubuntu:cosmic, and ubuntu:disco.)
  2. Also, install icecream in the image. I've tried the each one in bionic, cosmic, disco, and xenial, they didn't show any difference.
  3. Start a container with the image. I started both icecream and sshd so that people can login remotely and dispatch works. The icecream runs like: /usr/sbin/iceccd -m 0 --no-remote -d -n test -N hostname -l /tmp/iceccd.log -vv And then the sshd: /usr/sbin/sshd -D
  4. Build something, and iceccd will fail after some minutes.
mischief commented 5 years ago

i can reproduce this, somewhat reliably. i'm compiling the linux kernel, with -j 30, using a few raspberry pis. here is a dump from valgrind (using the ubuntu 18 packaged icecream).

[iceccd: main.cpp:1266: bool Daemon::handle_job_done(Client*, JobDoneMsg*): Assertion `msg->job_id == cl->job_id' failed.
11290] 15:00:28: !wait_for_msg()                                        
[11281] 15:00:28: timeout <= 0                                          
==13614==                                                             
==13614== Process terminating with default action of signal 6 (SIGABRT): dumping core
[11290] 15:00:28: timeout <= 0                          
==13614==    at 0x5DE4E97: raise (raise.c:51)    
==13614==    by 0x5DE6800: abort (abort.c:79)                         
==13614==    by 0x5DD6399: __assert_fail_base (assert.c:92)                          
==13614==    by 0x5DD6411: __assert_fail (assert.c:101)               
==13614==    by 0x113CCC: Daemon::handle_job_done(Client*, JobDoneMsg*) (main.cpp:1266)
==13614==    by 0x119895: Daemon::handle_activity(Client*) (main.cpp:1685)
==13614==    by 0x11A365: Daemon::answer_client_requests() (main.cpp:1935)           
==13614==    by 0x11AC89: Daemon::working_loop() (main.cpp:2027)
==13614==    by 0x111D33: main (main.cpp:2360)            
llunak commented 5 years ago

Sorry, but this is just too much work to reproduce locally. Please attach logs (with -vvv) from both the daemon and scheduler.