osrf / srcsim

Space Robotics Challenge
Other
9 stars 4 forks source link

TCP on production cloud at 10000 ms suffers #257

Open osrf-migration opened 7 years ago

osrf-migration commented 7 years ago

Original report (archived issue) by Jeremy White (Bitbucket: knitfoo).


We are running on round 2 and are suffering badly with our TCP comms.

We successfully navigated task 1, and made it part way through task 2.

Due to a bug in our comms, we had to reconnect our OCU to the FC. (But that is normal; our code is very tolerant of connects and disconnects, and is a planned part of our operation).

But with that new TCP connection, we were unable to regain working comms.

With a great deal of experimentation, we have determined that if we 'train' the TCP window, we can regain reliable comms.

That is, our steady state is to send about 150 bytes every other second. Periodically, we will send 70K image bursts. The 70k bursts will come in as 2k, 2k, and so on, until they stop at about 30k.

If instead, we send a lidar image (10k), it will come through in that slow fashion, but eventually come through. Then if you continue requesting lidar images, they will come through, without that delay.

After doing that, the TCP window appears to be trained, and you can get all images and everything just fine.

Wireshark corroborates that, although I have forgotten enough of my Stevens that I can't read what the various parameters mean.

osrf-migration commented 7 years ago

Original comment by Erica Tiberia (Bitbucket: T_AL).


I had the same problem with TCP, where my first connection was fine, but I was unable to reestablish a new connection mid run and wasn't able to regain comms - where typically this is set up to be no problem.

Thanks for the notes on a potential fix.