spcl / fmi

Function Message Interface (FMI): library for message-passing and collective communication for serverless functions.
https://mcopik.github.io/projects/fmi/
15 stars 10 forks source link

Nondeterministic connection breaking on AWS #9

Open mcopik opened 1 year ago

mcopik commented 1 year ago

We experienced that TCP connections created with TCPunch can randomly fail on AWS. So far, we have not found the primary issue - the observed behavior that a TCP message is suddenly lost after exchanging 16 - 64 kB of data between peers. The data is sent, as verified by the Wireshark analysis, but the receiver keeps retrying for a TCP packet that never arrives. We have been able to reproduce the issue between two VMs as well.

So far, we have implemented a workaround that attempts to exchange 64 kB between two peers and restarts the pairing process.