mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

[develop] Disconnect monitor will pick up normal shutdowns #223

Closed mschubert closed 3 years ago

mschubert commented 3 years ago
Q(function(x) Sys.sleep(x), x=c(0, 10), n_jobs=2)

first job is shutting down, disconnect monitor may (will?) treat it as error and fail

mschubert commented 3 years ago

This was merged to master as well, please use CRAN version until this is resolved

master reverted

mschubert commented 3 years ago

To add some more explanation here, every time you get a

1 peer(s) lost

without a worker crashing first, that is this bug.

This is caused by the fact that the monitor signal (my SO question here)

ZMQ_EVENT_DISCONNECTED \ The socket was disconnected unexpectedly. The event value is the FD of the underlying network socket. Warning: this socket will be closed.

also happens on a normal closing (or closing after explicitly disconnecting) the socket. So it looks like I need to track whether a disconnect is clean or not on the application level, where peer identities are normally abstracted away by the ZeroMQ sockets I'm using.

So this requires a fair bit of rewrite on the monitoring logic it seems

mschubert commented 3 years ago

fixed with bundled libzmq