icecc / icecream

Distributed compiler with a central scheduler to share build load
GNU General Public License v2.0
1.58k stars 250 forks source link

Scheduler removes daemon in same container: busy installing for a long time #497

Open DouglasRoyds opened 5 years ago

DouglasRoyds commented 5 years ago

I am running the scheduler and daemon inside a docker container. Other instances of this same image, running on different machines, are able to connect to and compile via this scheduler, but the scheduler runs no jobs on its own local daemon (in the same container). The daemon is occasionally being removed by the scheduler.

I am cross-compiling, with ICECC_VERSION pointing to the tarball on the source machine. On one of the successful machines, inside the docker container, I can see the installed tarball:

$ tree -L 4 /var/cache/icecc
/var/cache/icecc
`-- target=x86_64
    `-- 90ef9a5148e18c2e1ab09ad8e3fcc6ae
        |-- etc
        ...

The basedir in the container running the scheduler is empty.

In these logs, the scheduler (and the affected daemon) are running on the machine tbc-fourier:

$ grep tbc-fourier-docker /var/log/icecc-scheduler.log
[9] 2019-08-27 23:19:04: login tbc-fourier-docker protocol version: 42 features: env_xz []
[9] 2019-08-27 23:21:25: no job stats - returning randomly selected tbc-fourier-docker load: 259 can install: x86_64
[9] 2019-08-27 23:21:25: put 2 in joblist of tbc-fourier-docker (will install now)
[9] 2019-08-27 23:23:25: busy installing for a long time - removing tbc-fourier-docker
[9] 2019-08-27 23:23:25: remove daemon tbc-fourier-docker
[9] 2019-08-27 23:24:19: login tbc-fourier-docker protocol version: 42 features: env_xz []
[9] 2019-08-27 23:24:22: put 425 in joblist of tbc-fourier-docker (will install now)
[9] 2019-08-27 23:26:22: busy installing for a long time - removing tbc-fourier-docker
[9] 2019-08-27 23:26:22: remove daemon tbc-fourier-docker
...

Here's the daemon's view:

[10] 2019-08-27 23:18:28: ICECREAM daemon 1.2.90 starting up (nice level 5) 
[10] 2019-08-27 23:18:28: 4 CPU(s) online on this server
[10] 2019-08-27 23:18:28: allowing up to 4 active jobs
[10] 2019-08-27 23:18:28: supported features: env_xz
[10] 2019-08-27 23:18:28: not detaching
[10] 2019-08-27 23:18:28: entered process group
[10] 2019-08-27 23:18:28: ignoring localhost lo
[10] 2019-08-27 23:18:28: broadcast eth0 172.17.255.255
[10] 2019-08-27 23:18:28: Netnames:
[10] 2019-08-27 23:18:28: ICECREAM
[10] 2019-08-27 23:18:28: scheduler is on tbc-fourier:8765 (net )
[10] 2019-08-27 23:18:28: scheduler not yet found/selected.
[10] 2019-08-27 23:18:31: scheduler not yet found/selected.
[10] 2019-08-27 23:18:49: no response within timeout
[10] 2019-08-27 23:18:49: setting error state for channel 172.25.200.240: (A eof: 0)
[10] 2019-08-27 23:18:51: scheduler is on tbc-fourier:8765 (net )
[10] 2019-08-27 23:18:51: scheduler not yet found/selected.
[10] 2019-08-27 23:18:54: scheduler not yet found/selected.
[10] 2019-08-27 23:19:04: Connected to scheduler (I am known as 172.17.0.7)
[10] 2019-08-27 23:23:25: scheduler closed connection
[10] 2019-08-27 23:23:25: cleared children
[10] 2019-08-27 23:23:25: Delaying reconnect.
[10] 2019-08-27 23:23:28: Delaying reconnect.
...
[10] 2019-08-27 23:24:10: Delaying reconnect.
[10] 2019-08-27 23:24:13: scheduler is on tbc-fourier:8765 (net )
[10] 2019-08-27 23:24:13: scheduler not yet found/selected.
[10] 2019-08-27 23:24:16: scheduler not yet found/selected.
[10] 2019-08-27 23:24:19: Connected to scheduler (I am known as 172.17.0.7)
...

Not a show-stopper, I just don't get the benefit of the cores that are available on the machine that's running the scheduler.

Current master (1.2-136-g56e971e) compiled from source on Ubuntu 18.04. I observed the same behaviour with 1.2 (also compiled from source).

llunak commented 5 years ago

In these logs, the scheduler (and the affected daemon) are running on the machine tbc-fourier: ... [9] 2019-08-27 23:19:04: login tbc-fourier-docker protocol version: 42 features: env_xz [] [9] 2019-08-27 23:21:25: no job stats - returning randomly selected tbc-fourier-docker load: 259 can install: x86_64 [9] 2019-08-27 23:21:25: put 2 in joblist of tbc-fourier-docker (will install now) [9] 2019-08-27 23:23:25: busy installing for a long time - removing tbc-fourier-docker [9] 2019-08-27 23:23:25: remove daemon tbc-fourier-docker ... [10] 2019-08-27 23:18:28: scheduler is on tbc-fourier:8765 (net ) [10] 2019-08-27 23:18:28: scheduler not yet found/selected. [10] 2019-08-27 23:18:31: scheduler not yet found/selected. [10] 2019-08-27 23:18:49: no response within timeout [10] 2019-08-27 23:18:49: setting error state for channel 172.25.200.240: (A eof: 0) ...

If the machine is tbc-fourier, why does the log claim the daemon is on tbc-fourier-docker? From the logs it looks like some messages are not received properly, so maybe you have an incorrect network setup with those 2 hostnames and that confuses icecream?

DouglasRoyds commented 5 years ago

The host machine (VM) is tbc-fourier. Both scheduler and daemon are running inside a container on that host. The usual ports are published to the host machine when the container is started (TCP 10245, 8765, 8766; UDP 8765). tbc-fourier-docker is only the node_name that was passed to iceccd on the command line, via -N. The actual hostname visible to the scheduler and iceccd inside the docker container is a hashsum, being the container ID, eg. d8a524114c7a.

llunak commented 5 years ago

"scheduler is on tbc-fourier:8765" uses a hostname. Can you reach that from inside of the container (e.g. telnet tbc-fourier 8765)?

DouglasRoyds commented 5 years ago

Yes, I can connect:

$ sudo docker exec -it iceccd telnet tbc-fourier 8765
Trying 172.25.200.240...
Connected to tbc-fourier.taitradio.net.
Escape character is '^]'.
DouglasRoyds commented 5 years ago

If there's any additional debug messaging that might help, send me a patch ...

llunak commented 5 years ago

Is that really a full paste of the telnet output you get? You're supposed to get also an extra line with an asterisk. If you do not get that, then your daemon can reach the scheduler, but the scheduler cannot reach back, so it still looks like a network misconfiguration.

DouglasRoyds commented 5 years ago

I left the asterisk off the pasted output, sorry. It was there.

llunak commented 5 years ago

Ok, fair enough. Can you debug the issue yourself with some hints? Given that I don't have a setup to reproduce the problem, I don't quite know in which all places to add further debug and we'd possibly spend a number of iterations on this.

The relevant source file is services/comm.cpp , the class MsgChannel. The problem is probably the "no response within timeout" in the log, which is in wait_for_protocol(). I would suggest adding trace() commands to all places that refer to NEED_PROTO. Something like a.patch.txt could do for a first try.

DouglasRoyds commented 5 years ago

The daemon:

[10] 2019-09-17 05:08:13: ICECREAM daemon 1.3 starting up (nice level 5) 
[10] 2019-09-17 05:08:13: 4 CPU(s) online on this server
[10] 2019-09-17 05:08:13: allowing up to 4 active jobs
[10] 2019-09-17 05:08:13: supported features: env_xz
[10] 2019-09-17 05:08:13: not detaching
[10] 2019-09-17 05:08:13: entered process group
[10] 2019-09-17 05:08:13: broadcast eth0 172.17.255.255
[10] 2019-09-17 05:08:13: ignoring localhost lo for broadcast
[10] 2019-09-17 05:08:13: Netnames:
[10] 2019-09-17 05:08:13: ICECREAM
[10] 2019-09-17 05:08:13: starting to listen on all interfaces
[10] 2019-09-17 05:08:13: scheduler is on tbc-fourier:8765 (net )
[10] 2019-09-17 05:08:13: scheduler not yet found/selected.
[10] 2019-09-17 05:08:16: scheduler not yet found/selected.
[10] 2019-09-17 05:08:19: msgchannel ctor
[10] 2019-09-17 05:08:19: wait_for_protocol1:6
[10] 2019-09-17 05:08:34: no response within timeout
[10] 2019-09-17 05:08:34: setting error state for channel 172.25.200.240: (A eof: 0)
[10] 2019-09-17 05:08:34: update_state1:0:0
[10] 2019-09-17 05:08:36: scheduler is on tbc-fourier:8765 (net )
[10] 2019-09-17 05:08:36: scheduler not yet found/selected.
[10] 2019-09-17 05:08:39: scheduler not yet found/selected.
[10] 2019-09-17 05:08:42: msgchannel ctor
[10] 2019-09-17 05:08:42: wait_for_protocol1:6
[10] 2019-09-17 05:08:49: update_state1:4:0
[10] 2019-09-17 05:08:49: update_state2:-1:42
[10] 2019-09-17 05:08:49: update_state3:-43
[10] 2019-09-17 05:08:49: wait_for_protocol3:6
[10] 2019-09-17 05:08:49: wait_for_protocol1:6
[10] 2019-09-17 05:08:49: update_state1:4:0
[10] 2019-09-17 05:08:49: update_state2:-43:42
[10] 2019-09-17 05:08:49: update_state4:42
[10] 2019-09-17 05:08:49: wait_for_protocol3:6
[10] 2019-09-17 05:08:49: Connected to scheduler (I am known as 172.17.0.2)
[10] 2019-09-17 05:08:49: msgchannel ctor
[10] 2019-09-17 05:08:49: wait_for_protocol1:7
[10] 2019-09-17 05:08:49: update_state1:0:0
[10] 2019-09-17 05:08:49: wait_for_protocol2:1
[10] 2019-09-17 05:09:49: msgchannel ctor
[10] 2019-09-17 05:09:49: wait_for_protocol1:9
[10] 2019-09-17 05:09:49: update_state1:0:0
[10] 2019-09-17 05:09:49: wait_for_protocol2:1
[10] 2019-09-17 05:10:49: msgchannel ctor
[10] 2019-09-17 05:10:49: wait_for_protocol1:9
[10] 2019-09-17 05:10:49: update_state1:0:0
[10] 2019-09-17 05:10:49: wait_for_protocol2:1
[10] 2019-09-17 05:11:49: msgchannel ctor
[10] 2019-09-17 05:11:49: wait_for_protocol1:9
[10] 2019-09-17 05:11:49: update_state1:0:0
[10] 2019-09-17 05:11:49: wait_for_protocol2:1
[10] 2019-09-17 05:12:49: msgchannel ctor
[10] 2019-09-17 05:12:49: wait_for_protocol1:9
[10] 2019-09-17 05:12:49: update_state1:0:0
[10] 2019-09-17 05:12:49: wait_for_protocol2:1
[10] 2019-09-17 05:13:49: msgchannel ctor
[10] 2019-09-17 05:13:49: wait_for_protocol1:9
[10] 2019-09-17 05:13:49: update_state1:0:0
[10] 2019-09-17 05:13:49: wait_for_protocol2:1
[10] 2019-09-17 05:14:49: msgchannel ctor
[10] 2019-09-17 05:14:49: wait_for_protocol1:9
[10] 2019-09-17 05:14:49: update_state1:0:0
[10] 2019-09-17 05:14:49: wait_for_protocol2:1
[10] 2019-09-17 05:14:51: scheduler closed connection
[10] 2019-09-17 05:14:51: cleared children
[10] 2019-09-17 05:14:51: Delaying reconnect.
[10] 2019-09-17 05:14:54: Delaying reconnect.
[10] 2019-09-17 05:14:57: Delaying reconnect.
[10] 2019-09-17 05:15:00: Delaying reconnect.
[10] 2019-09-17 05:15:03: Delaying reconnect.
[10] 2019-09-17 05:15:06: Delaying reconnect.
[10] 2019-09-17 05:15:09: Delaying reconnect.

The scheduler (running in the same container):

[8] 2019-09-17 05:08:13: ICECREAM scheduler 1.3 starting up, port 8765
[9] 2019-09-17 05:08:13: scheduler ready
[9] 2019-09-17 05:08:13: broadcast eth0 172.17.255.255
[9] 2019-09-17 05:08:13: ignoring localhost lo for broadcast
[9] 2019-09-17 05:08:13: broadcast eth0 172.17.255.255
[9] 2019-09-17 05:08:13: ignoring localhost lo for broadcast
[9] 2019-09-17 05:08:13: Received scheduler announcement from 172.17.0.2:49305 (version 42, netname ICECREAM)
[9] 2019-09-17 05:08:13: msgchannel ctor
[9] 2019-09-17 05:08:13: accepted 172.16.169.141
[9] 2019-09-17 05:08:13: update_state1:4:0
[9] 2019-09-17 05:08:13: update_state2:-1:32
[9] 2019-09-17 05:08:13: update_state3:-33
[9] 2019-09-17 05:08:13: update_state1:4:0
[9] 2019-09-17 05:08:13: update_state2:-33:32
[9] 2019-09-17 05:08:13: update_state4:32
[9] 2019-09-17 05:08:13: Received scheduler announcement from 172.17.0.1:33668 (version 42, netname ICECREAM)
[9] 2019-09-17 05:08:13: broadcast from 172.17.0.2:60359 (version 42)
[9] 2019-09-17 05:08:13: broadcast from 172.17.0.1:42904 (version 42)
[9] 2019-09-17 05:08:49: msgchannel ctor
[9] 2019-09-17 05:08:49: accepted 172.17.0.1
[9] 2019-09-17 05:08:49: update_state1:4:0
[9] 2019-09-17 05:08:49: update_state2:-1:42
[9] 2019-09-17 05:08:49: update_state3:-43
[9] 2019-09-17 05:08:49: msgchannel ctor
[9] 2019-09-17 05:08:49: accepted 172.16.169.141
[9] 2019-09-17 05:08:49: update_state1:4:0
[9] 2019-09-17 05:08:49: update_state2:-1:42
[9] 2019-09-17 05:08:49: update_state3:-43
[9] 2019-09-17 05:08:49: msgchannel ctor
[9] 2019-09-17 05:08:49: accepted 172.25.40.12
[9] 2019-09-17 05:08:49: update_state1:4:0
[9] 2019-09-17 05:08:49: update_state2:-1:42
[9] 2019-09-17 05:08:49: update_state3:-43
[9] 2019-09-17 05:08:49: msgchannel ctor
[9] 2019-09-17 05:08:49: accepted 172.16.169.141
...
[9] 2019-09-17 05:08:49: accepted 172.16.169.141
[9] 2019-09-17 05:08:49: update_state1:4:0
[9] 2019-09-17 05:08:49: update_state2:-1:32
[9] 2019-09-17 05:08:49: update_state3:-33
[9] 2019-09-17 05:08:49: msgchannel ctor
[9] 2019-09-17 05:08:49: accepted 172.25.200.241
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:70:0
[9] 2019-09-17 05:08:49: update_state2:-43:42
[9] 2019-09-17 05:08:49: update_state4:42
[9] 2019-09-17 05:08:49: login tbc-fourier-docker protocol version: 42 features: env_xz []
[9] 2019-09-17 05:08:49: update_state1:0:0
[9] 2019-09-17 05:08:49: remote end had UNKNOWN type?
[9] 2019-09-17 05:08:49: update_state1:114:0
[9] 2019-09-17 05:08:49: update_state2:-43:42
[9] 2019-09-17 05:08:49: update_state4:42
[9] 2019-09-17 05:08:49: login douglas-docker protocol version: 42 features: env_xz [f90274af152ffac7d855c3313beaca9b(x86_64), ]
[9] 2019-09-17 05:08:49: update_state1:116:0
[9] 2019-09-17 05:08:49: update_state2:-43:42
[9] 2019-09-17 05:08:49: update_state4:42
[9] 2019-09-17 05:08:49: login tbc-vesta-docker protocol version: 42 features: env_xz [f90274af152ffac7d855c3313beaca9b(x86_64), ]
[9] 2019-09-17 05:08:49: update_state1:4:0
[9] 2019-09-17 05:08:49: update_state2:-33:32
[9] 2019-09-17 05:08:49: update_state4:32
[9] 2019-09-17 05:08:50: update_state1:8:0
[9] 2019-09-17 05:08:50: update_state2:-1:42
[9] 2019-09-17 05:08:50: update_state3:-43
[9] 2019-09-17 05:08:50: update_state2:-43:42
[9] 2019-09-17 05:08:50: update_state4:42
[9] 2019-09-17 05:08:50: login tbc-humboldt-docker protocol version: 42 features: env_xz [f90274af152ffac7d855c3313beaca9b(x86_64), ]
[9] 2019-09-17 05:08:50: flush_writebuf() failed(Error: Broken pipe)
[9] 2019-09-17 05:08:50: setting error state for channel 172.16.169.141: (B eof: 0)
[9] 2019-09-17 05:08:50: monitor is blocking... removing
[9] 2019-09-17 05:10:27: broadcast eth0 172.17.255.255
[9] 2019-09-17 05:10:27: ignoring localhost lo for broadcast
[9] 2019-09-17 05:10:27: broadcast eth0 172.17.255.255
[9] 2019-09-17 05:10:27: ignoring localhost lo for broadcast
[9] 2019-09-17 05:10:27: Received scheduler announcement from 172.17.0.2:39575 (version 42, netname ICECREAM)
[9] 2019-09-17 05:10:27: Received scheduler announcement from 172.17.0.1:48419 (version 42, netname ICECREAM)
[9] 2019-09-17 05:12:49: broadcast eth0 172.17.255.255
[9] 2019-09-17 05:12:49: ignoring localhost lo for broadcast
[9] 2019-09-17 05:12:49: broadcast eth0 172.17.255.255
[9] 2019-09-17 05:12:49: ignoring localhost lo for broadcast
[9] 2019-09-17 05:12:49: Received scheduler announcement from 172.17.0.2:52834 (version 42, netname ICECREAM)
[9] 2019-09-17 05:12:49: Received scheduler announcement from 172.17.0.1:43105 (version 42, netname ICECREAM)
[9] 2019-09-17 05:12:51: NEW 1 client=douglas-docker versions=[f90274af152ffac7d855c3313beaca9b(x86_64)] tait-davinci-dspbios/hal/gpio.c C
[9] 2019-09-17 05:12:51: no job stats - returning randomly selected tbc-vesta-docker load: 262 can install: x86_64
[9] 2019-09-17 05:12:51: put 1 in joblist of tbc-vesta-docker
[9] 2019-09-17 05:12:51: BEGIN: 1 client=douglas-docker(x86_64) server=tbc-vesta-docker(x86_64)
[9] 2019-09-17 05:12:51: NEW 2 client=douglas-docker versions=[f90274af152ffac7d855c3313beaca9b(x86_64)] dalink/loader/dalink_loadlib.c C
[9] 2019-09-17 05:12:51: NEW 3 client=douglas-docker versions=[f90274af152ffac7d855c3313beaca9b(x86_64)] hdrbuild/os/bool.h.c C
[9] 2019-09-17 05:12:51: no job stats - returning randomly selected tbc-humboldt-docker load: 92 can install: x86_64
[9] 2019-09-17 05:12:51: put 2 in joblist of tbc-humboldt-docker
[9] 2019-09-17 05:12:51: no job stats - returning randomly selected tbc-fourier-docker load: 238 can install: x86_64
[9] 2019-09-17 05:12:51: put 3 in joblist of tbc-fourier-docker (will install now)
[9] 2019-09-17 05:12:51: BEGIN: 2 client=douglas-docker(x86_64) server=tbc-humboldt-docker(x86_64)
[9] 2019-09-17 05:12:51: NEW 4 client=douglas-docker versions=[f90274af152ffac7d855c3313beaca9b(x86_64)] shared/test/test_circularbuffer.cpp C++
[9] 2019-09-17 05:12:51: no job stats - returning randomly selected douglas-docker load: 302 can install: x86_64
[9] 2019-09-17 05:12:51: put 4 in joblist of douglas-docker
[9] 2019-09-17 05:12:51: NEW 5 client=douglas-docker versions=[f90274af152ffac7d855c3313beaca9b(x86_64)] dalink/tools/dastatprint.c C
[9] 2019-09-17 05:12:51: NEW 6 client=douglas-docker versions=[f90274af152ffac7d855c3313beaca9b(x86_64)] framework/tools/loghighlight.c C
[9] 2019-09-17 05:12:51: BEGIN: 4 client=douglas-docker(x86_64) server=douglas-docker(x86_64)
...
[9] 2019-09-17 05:14:50: END 561 status=0 in=1715676(21%) out=249320(100%) real=3004 user=474 sys=46 pfaults=22037 server=tbc-humboldt-docker
[9] 2019-09-17 05:14:50: broadcast eth0 172.17.255.255
[9] 2019-09-17 05:14:50: ignoring localhost lo for broadcast
[9] 2019-09-17 05:14:50: broadcast eth0 172.17.255.255
[9] 2019-09-17 05:14:50: ignoring localhost lo for broadcast
[9] 2019-09-17 05:14:50: BEGIN: 569 client=douglas-docker(x86_64) server=tbc-humboldt-docker(x86_64)
[9] 2019-09-17 05:14:50: Received scheduler announcement from 172.17.0.2:59233 (version 42, netname ICECREAM)
[9] 2019-09-17 05:14:50: Received scheduler announcement from 172.17.0.1:49602 (version 42, netname ICECREAM)
[9] 2019-09-17 05:14:50: NEW 570 client=douglas-docker versions=[f90274af152ffac7d855c3313beaca9b(x86_64)] framework/active/evsubscriber.cpp C++
[9] 2019-09-17 05:14:50: put 570 in joblist of tbc-humboldt-docker
[9] 2019-09-17 05:14:50: END 560 status=0 in=1719192(21%) out=218856(100%) real=3159 user=737 sys=128 pfaults=21936 server=tbc-vesta-docker
[9] 2019-09-17 05:14:50: BEGIN: 565 client=douglas-docker(x86_64) server=tbc-vesta-docker(x86_64)
[9] 2019-09-17 05:14:51: NEW 571 client=douglas-docker versions=[f90274af152ffac7d855c3313beaca9b(x86_64)] framework/active/inputactionbuilder.cpp C++
[9] 2019-09-17 05:14:51: busy installing for a long time - removing tbc-fourier-docker
[9] 2019-09-17 05:14:51: remove daemon tbc-fourier-docker

You can see that I launched the compilation (on "douglas-docker") at 05:12:51

llunak commented 5 years ago

According to the logs the most likely explanation still seems to be that the daemon cannot connect to the scheduler, even though you can telnet to it. Can you try with this patch? Also please disable all the other daemons when running the test. b.patch.txt

DouglasRoyds commented 4 years ago

The daemon:

...
[10] 2019-10-01 23:44:22: scheduler closed connection
[10] 2019-10-01 23:44:22: cleared children
[10] 2019-10-01 23:44:22: Delaying reconnect.
[10] 2019-10-01 23:44:25: Delaying reconnect.
[10] 2019-10-01 23:44:28: Delaying reconnect.
[10] 2019-10-01 23:44:31: Delaying reconnect.
[10] 2019-10-01 23:44:34: Delaying reconnect.
[10] 2019-10-01 23:44:37: Delaying reconnect.
[10] 2019-10-01 23:44:40: Delaying reconnect.
[10] 2019-10-01 23:44:43: Delaying reconnect.
[10] 2019-10-01 23:44:46: Delaying reconnect.
[10] 2019-10-01 23:44:49: Delaying reconnect.
[10] 2019-10-01 23:44:52: Delaying reconnect.
[10] 2019-10-01 23:44:55: scheduler is on tbc-fourier:8765 (net )
[10] 2019-10-01 23:44:55: scheduler not yet found/selected.
[10] 2019-10-01 23:44:58: scheduler not yet found/selected.
[10] 2019-10-01 23:45:01: msgchannel ctor:172.25.200.240: (A eof: 0):
[10] 2019-10-01 23:45:01: wait_for_protocol1:172.25.200.240: (A eof: 0):6
[10] 2019-10-01 23:45:01: read_a_bit1:172.25.200.240: (A eof: 0)
[10] 2019-10-01 23:45:01: read_a_bit2:172.25.200.240: (A eof: 0):128:4
[10] 2019-10-01 23:45:01: read_a_bit3:172.25.200.240: (A eof: 0):4
[10] 2019-10-01 23:45:01: update_state1:172.25.200.240: (A eof: 0):4:0
[10] 2019-10-01 23:45:01: update_state2:172.25.200.240: (A eof: 0):-1:42
[10] 2019-10-01 23:45:01: update_state3:172.25.200.240: (A eof: 0):-43
[10] 2019-10-01 23:45:01: wait_for_protocol3:172.25.200.240: (A eof: 0):6
[10] 2019-10-01 23:45:01: wait_for_protocol1:172.25.200.240: (A eof: 0):6
[10] 2019-10-01 23:45:01: read_a_bit1:172.25.200.240: (A eof: 0)
[10] 2019-10-01 23:45:01: read_a_bit2:172.25.200.240: (A eof: 0):128:0
[10] 2019-10-01 23:45:01: read_a_bit3:172.25.200.240: (A eof: 1):0
[10] 2019-10-01 23:45:01: update_state1:172.25.200.240: (A eof: 1):0:0
[10] 2019-10-01 23:45:01: wait_for_protocol2:172.25.200.240: (A eof: 1):1
[10] 2019-10-01 23:45:01: scheduler is on tbc-fourier:8765 (net )
[10] 2019-10-01 23:45:01: scheduler not yet found/selected.
[10] 2019-10-01 23:45:04: scheduler not yet found/selected.
[10] 2019-10-01 23:45:07: msgchannel ctor:172.25.200.240: (A eof: 0):
[10] 2019-10-01 23:45:07: wait_for_protocol1:172.25.200.240: (A eof: 0):6
[10] 2019-10-01 23:45:22: no response within timeout172.25.200.240: (A eof: 0):
[10] 2019-10-01 23:45:22: setting error state for channel 172.25.200.240: (A eof: 0)
[10] 2019-10-01 23:45:22: read_a_bit1:172.25.200.240: (A eof: 0)
[10] 2019-10-01 23:45:22: read_a_bit2:172.25.200.240: (A eof: 0):128:-1
[10] 2019-10-01 23:45:22: read_a_bit3:172.25.200.240: (A eof: 0):0
[10] 2019-10-01 23:45:22: update_state1:172.25.200.240: (A eof: 0):0:0
[10] 2019-10-01 23:45:22: read_a_bit5:172.25.200.240: (A eof: 0)
[10] 2019-10-01 23:45:22: 172.25.200.240: (E eof: 1):!read_a_bit
[10] 2019-10-01 23:45:22: !wait_for_msg()172.25.200.240: (E eof: 1)
[10] 2019-10-01 23:45:22: scheduler is on tbc-fourier:8765 (net )
[10] 2019-10-01 23:45:22: scheduler not yet found/selected.
[10] 2019-10-01 23:45:25: scheduler not yet found/selected.
[10] 2019-10-01 23:45:28: msgchannel ctor:172.25.200.240: (A eof: 0):
[10] 2019-10-01 23:45:28: wait_for_protocol1:172.25.200.240: (A eof: 0):6
[10] 2019-10-01 23:45:34: read_a_bit1:172.25.200.240: (A eof: 0)
[10] 2019-10-01 23:45:34: read_a_bit2:172.25.200.240: (A eof: 0):128:8
[10] 2019-10-01 23:45:34: read_a_bit3:172.25.200.240: (A eof: 0):8
[10] 2019-10-01 23:45:34: update_state1:172.25.200.240: (A eof: 0):8:0
[10] 2019-10-01 23:45:34: update_state2:172.25.200.240: (A eof: 0):-1:42
[10] 2019-10-01 23:45:34: update_state3:172.25.200.240: (A eof: 0):-43
[10] 2019-10-01 23:45:34: update_state2:172.25.200.240: (A eof: 0):-43:42
[10] 2019-10-01 23:45:34: update_state4:172.25.200.240: (A eof: 0):42
[10] 2019-10-01 23:45:34: wait_for_protocol3:172.25.200.240: (B eof: 0):6
[10] 2019-10-01 23:45:34: Connected to scheduler (I am known as 172.17.0.2)
[10] 2019-10-01 23:45:34: read_a_bit1:172.25.200.240: (B eof: 0)
[10] 2019-10-01 23:45:34: read_a_bit2:172.25.200.240: (B eof: 0):128:21
[10] 2019-10-01 23:45:34: read_a_bit3:172.25.200.240: (B eof: 0):21
[10] 2019-10-01 23:45:34: read_a_bit1:172.25.200.240: (B eof: 0)
[10] 2019-10-01 23:45:34: read_a_bit2:172.25.200.240: (B eof: 0):128:0
[10] 2019-10-01 23:45:34: read_a_bit3:172.25.200.240: (B eof: 1):0
[10] 2019-10-01 23:45:34: scheduler closed connection
[10] 2019-10-01 23:45:34: cleared children
...

The scheduler:

[9] 2019-10-01 23:44:22: remove daemon tbc-fourier-docker
[9] 2019-10-01 23:44:58: accepting 172.17.0.1
[9] 2019-10-01 23:44:58: msgchannel ctor:172.17.0.1: (A eof: 0):
[9] 2019-10-01 23:44:58: accepted 172.17.0.1
[9] 2019-10-01 23:44:58: read_a_bit1:172.17.0.1: (A eof: 0)
[9] 2019-10-01 23:44:58: read_a_bit2:172.17.0.1: (A eof: 0):128:-1
[9] 2019-10-01 23:44:58: read_a_bit3:172.17.0.1: (A eof: 0):0
[9] 2019-10-01 23:44:58: update_state1:172.17.0.1: (A eof: 0):0:0
[9] 2019-10-01 23:44:58: read_a_bit5:172.17.0.1: (A eof: 0)
[9] 2019-10-01 23:44:58: !wait_for_msg()172.17.0.1: (E eof: 1)
[9] 2019-10-01 23:44:58: remote end had UNKNOWN type?
[9] 2019-10-01 23:45:34: ignoring localhost lo for broadcast
[9] 2019-10-01 23:45:34: broadcast eth0 172.17.255.255
[9] 2019-10-01 23:45:34: ignoring localhost lo for broadcast
[9] 2019-10-01 23:45:34: broadcast eth0 172.17.255.255
[9] 2019-10-01 23:45:34: accepting 172.17.0.1
[9] 2019-10-01 23:45:34: msgchannel ctor:172.17.0.1: (A eof: 0):
[9] 2019-10-01 23:45:34: accepted 172.17.0.1
[9] 2019-10-01 23:45:34: read_a_bit1:172.17.0.1: (A eof: 0)
[9] 2019-10-01 23:45:34: read_a_bit2:172.17.0.1: (A eof: 0):128:4
[9] 2019-10-01 23:45:34: read_a_bit3:172.17.0.1: (A eof: 0):4
[9] 2019-10-01 23:45:34: update_state1:172.17.0.1: (A eof: 0):4:0
[9] 2019-10-01 23:45:34: update_state2:172.17.0.1: (A eof: 0):-1:42
[9] 2019-10-01 23:45:34: update_state3:172.17.0.1: (A eof: 0):-43
[9] 2019-10-01 23:45:34: accepting 172.17.0.1
[9] 2019-10-01 23:45:34: msgchannel ctor:172.17.0.1: (A eof: 0):
[9] 2019-10-01 23:45:34: accepted 172.17.0.1
[9] 2019-10-01 23:45:34: read_a_bit1:172.17.0.1: (A eof: 0)
[9] 2019-10-01 23:45:34: read_a_bit2:172.17.0.1: (A eof: 0):128:4
[9] 2019-10-01 23:45:34: read_a_bit3:172.17.0.1: (A eof: 0):4
[9] 2019-10-01 23:45:34: update_state1:172.17.0.1: (A eof: 0):4:0
[9] 2019-10-01 23:45:34: update_state2:172.17.0.1: (A eof: 0):-1:42
[9] 2019-10-01 23:45:34: update_state3:172.17.0.1: (A eof: 0):-43
[9] 2019-10-01 23:45:34: read_a_bit1:172.17.0.1: (A eof: 0)
[9] 2019-10-01 23:45:34: read_a_bit2:172.17.0.1: (A eof: 0):128:0
[9] 2019-10-01 23:45:34: read_a_bit3:172.17.0.1: (A eof: 1):0
[9] 2019-10-01 23:45:34: update_state1:172.17.0.1: (A eof: 1):0:0
[9] 2019-10-01 23:45:34: remote end had UNKNOWN type?
[9] 2019-10-01 23:45:34: Received scheduler announcement from 172.17.0.2:59920 (version 42, netname ICECREAM)
[9] 2019-10-01 23:45:34: Received scheduler announcement from 172.17.0.1:48558 (version 42, netname ICECREAM)
[9] 2019-10-01 23:45:34: read_a_bit1:172.17.0.1: (A eof: 0)
[9] 2019-10-01 23:45:34: read_a_bit2:172.17.0.1: (A eof: 0):128:4
[9] 2019-10-01 23:45:34: read_a_bit3:172.17.0.1: (A eof: 0):4
[9] 2019-10-01 23:45:34: update_state1:172.17.0.1: (A eof: 0):4:0
[9] 2019-10-01 23:45:34: update_state2:172.17.0.1: (A eof: 0):-43:42
[9] 2019-10-01 23:45:34: update_state4:172.17.0.1: (A eof: 0):42
[9] 2019-10-01 23:45:34: read_a_bit1:172.17.0.1: (B eof: 0)
[9] 2019-10-01 23:45:34: read_a_bit2:172.17.0.1: (B eof: 0):128:66
[9] 2019-10-01 23:45:34: read_a_bit3:172.17.0.1: (B eof: 0):66
[9] 2019-10-01 23:45:34: login tbc-fourier-docker protocol version: 42 features: env_xz []
[9] 2019-10-01 23:45:34: read_a_bit1:172.17.0.1: (B eof: 0)
[9] 2019-10-01 23:45:34: read_a_bit2:172.17.0.1: (B eof: 0):128:-1
[9] 2019-10-01 23:45:34: read_a_bit3:172.17.0.1: (B eof: 0):0
[9] 2019-10-01 23:45:34: read_a_bit5:172.17.0.1: (B eof: 0)
[9] 2019-10-01 23:45:34: setting error state for channel 172.17.0.1: (B eof: 0)
[9] 2019-10-01 23:45:34: read_a_bit1:172.17.0.1: (B eof: 0)
[9] 2019-10-01 23:45:34: read_a_bit2:172.17.0.1: (B eof: 0):128:-1
[9] 2019-10-01 23:45:34: read_a_bit3:172.17.0.1: (B eof: 0):0
[9] 2019-10-01 23:45:34: read_a_bit5:172.17.0.1: (B eof: 0)
[9] 2019-10-01 23:45:34: 172.17.0.1: (E eof: 1):!read_a_bit
[9] 2019-10-01 23:45:34: !wait_for_msg()172.17.0.1: (E eof: 1)
[9] 2019-10-01 23:45:34: !wait_for_msg()172.17.0.1: (E eof: 1)
[9] 2019-10-01 23:45:34: remove daemon tbc-fourier-docker

In this case I had no other daemons running, and wasn't attempting to compile anything.