TCP on production cloud at 10000 ms suffers

Original report (archived issue) by Jeremy White (Bitbucket: knitfoo).

We are running on round 2 and are suffering badly with our TCP comms.

We successfully navigated task 1, and made it part way through task 2.

Due to a bug in our comms, we had to reconnect our OCU to the FC. (But that is normal; our code is very tolerant of connects and disconnects, and is a planned part of our operation).

But with that new TCP connection, we were unable to regain working comms.

With a great deal of experimentation, we have determined that if we 'train' the TCP window, we can regain reliable comms.

That is, our steady state is to send about 150 bytes every other second. Periodically, we will send 70K image bursts. The 70k bursts will come in as 2k, 2k, and so on, until they stop at about 30k.

If instead, we send a lidar image (10k), it will come through in that slow fashion, but eventually come through. Then if you continue requesting lidar images, they will come through, without that delay.

After doing that, the TCP window appears to be trained, and you can get all images and everything just fine.

Wireshark corroborates that, although I have forgotten enough of my Stevens that I can't read what the various parameters mean.

osrf / srcsim

TCP on production cloud at 10000 ms suffers #257