Closed mschubert closed 3 years ago
Hi @mschubert, I'm just wondering whether you're leaning more toward this or #33 , or whether these two approaches may work together?
I know that these two functionalities are part of the v0.9 roadmap, and since I use clustermq so much for work, I'm happy to put in some work towards the v0.9 roadmap, particularly timeout robustness, and fault tolerance.
Hi, just wondering if this ever got implemented? I've had some jobs killed due to timeout etc and the master process doesn't seem to notice.
It's implemented in develop
but not usable yet:
The original implementation with rzmq
couldn't distinguish between clean disconnects and broken connections, so I'm rewriting using libzmq
/cppzmq
directly.
That makes sense, thanks for the update!
There is a zmq-socket-monitor that can be used to detect disconnects of peers, e.g. when the workers shut down unexpectedly or are killed (checked this; the
FIN
TCP/IP package is still sent withkill -9
)This would make us much more robust to stalling because of worker crashes
The approach might replace #33