huitema / quic-in-space

Discuss QUIC in Space, produce draft
Other
1 stars 1 forks source link

Long duration connections with intermittent connectivity #12

Open huitema opened 9 months ago

huitema commented 9 months ago

We need a discussion there, because it is not obvious that long duration connections work well, especially with intermittent connectivity, like what would occur between Earth and a station on a rotating planet.

Suppose that the normal RTT is 20 minutes, but that every 12 hours the connectivity is suspended for 12 hours. The endpoints will detect the failure of the connectivity after 1 PTO, which is about 1 RTT. The QUIC code will attempt to repeat the last packet, per RFC 9002, and wait first for one RTO, then double that after each iteration. This gives the following:

RTT Iteration Timeout (s) Time elapsed (hours)
12000 0 1200 0.33
12000 1 1200 0.67
12000 2 2400 1.33
12000 3 4800 2.67
12000 4 9600 5.33
12000 5 19200 10.67
12000 6 38400 21.33
12000 7 76800 21.67

We can certainly let the endpoints do that, but look what happens at the end. At the repetition number 6, the exponential backoff has increased the timeout to over 10 hours. The repetition number 7 starts 21:20 hours after the initial "cut", when the timeout number 6 fires. It succeeds, ant the endpoint learns at 21:40 that the link is back on. But that's 9:40 hours after the path was restored!

In short, relying on the standard timer behavior is not good enough. We will need a new parameter, something like a maximum interval between repetitions, set as a fraction of the average duration of connectivity. Or, we will have to add a "connectivity restored" signal of some kind, to kick the QUIC stack out of its sleeping state.

huitema commented 9 months ago

There are other issues. We want connections to last "forever", and keep repeating some kind of probe until they get a reply. But what if a connection is actually broken? Some of the end to end connections go to the work stations of the controller. What if that work station fails, or is rebooted for a system update?

Most QUIC deployment avoid connections that last too long, and instead rely on QUIC "session resumption" facility (with or without 0 RTT).

It might be good to consider that all these 'periodic ping' and other 'session resume attempts' should be started by the system with the most resource, e.g., a workstation rather than a spacecraft.