sacs-epfl / decentralizepy

A decentralized learning research framework
MIT License
24 stars 18 forks source link

Fix Message Loss with ZMQ #9

Open rishi-s8 opened 1 year ago

rishi-s8 commented 1 year ago

With large number of nodes on a single machine, i.e. a lot of sockets, some messages are dropped by ZMQ. Investigate why and exactly in what scenario this happens. TCP_ACK introduces application-level retransmissions to get around this problem, but with TCP, messages should be ordered first-in-first-out. Hence, investigate by stress-testing it, if still has the same problems as the basic version. If not, describe why.

rishi-s8 commented 6 months ago

Investigated this, looks like a memory issue. We can set configs in ZMQ to never drop messages, but it only works when enough DRAM and/or Swap is available.