mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Monitor sockets for disconnects #150

Closed mschubert closed 3 years ago

mschubert commented 5 years ago

There is a zmq-socket-monitor that can be used to detect disconnects of peers, e.g. when the workers shut down unexpectedly or are killed (checked this; the FIN TCP/IP package is still sent with kill -9)

This would make us much more robust to stalling because of worker crashes

The approach might replace #33

strazto commented 4 years ago

Hi @mschubert, I'm just wondering whether you're leaning more toward this or #33 , or whether these two approaches may work together?

I know that these two functionalities are part of the v0.9 roadmap, and since I use clustermq so much for work, I'm happy to put in some work towards the v0.9 roadmap, particularly timeout robustness, and fault tolerance.

multimeric commented 1 year ago

Hi, just wondering if this ever got implemented? I've had some jobs killed due to timeout etc and the master process doesn't seem to notice.

mschubert commented 1 year ago

It's implemented in develop but not usable yet:

https://github.com/mschubert/clustermq/blob/8fdb325bafbaa13a35c5349ed0cf476ee93f6837/src/CMQMaster.cpp#L73-L74

The original implementation with rzmq couldn't distinguish between clean disconnects and broken connections, so I'm rewriting using libzmq/cppzmq directly.

multimeric commented 1 year ago

That makes sense, thanks for the update!