With large number of nodes on a single machine, i.e. a lot of sockets, some messages are dropped by ZMQ. Investigate why and exactly in what scenario this happens.
TCP_ACK introduces application-level retransmissions to get around this problem, but with TCP, messages should be ordered first-in-first-out. Hence, investigate by stress-testing it, if still has the same problems as the basic version. If not, describe why.
Investigated this, looks like a memory issue. We can set configs in ZMQ to never drop messages, but it only works when enough DRAM and/or Swap is available.
With large number of nodes on a single machine, i.e. a lot of sockets, some messages are dropped by ZMQ. Investigate why and exactly in what scenario this happens.
TCP_ACK
introduces application-level retransmissions to get around this problem, but with TCP, messages should be ordered first-in-first-out. Hence, investigate by stress-testing it, if still has the same problems as the basic version. If not, describe why.