huitema / quic-in-space

Discuss QUIC in Space, produce draft
Other
1 stars 1 forks source link

idle_timeout = zero #3

Open marcblanchet opened 11 months ago

marcblanchet commented 11 months ago

We are currently talking about the idle_timeout which would have a value, but I think we should say something about not setting the idle_timeout. In fact, that should be a strong recommendation. Because the use case of deep space is that the ground stations and systems will be "talking" to the spacecrafts for as long as the whole lifetime of the mission: e.g. months, years, sometimes decades. And given that re-establishing a connection is "costly" in space because of delays and disruptions, and that re-establishing a connection is actually an additional risk if it fails or take too much time (think of urgent command to be sent because of a big issue), then the idle_timeout shall be preferred to be not set or set to zero. I can write something and make a PR if you agree.

huitema commented 11 months ago

There are two parts: adapting to long delay links so the timeout does not cause unwanted disconnection; and, not closing connections unless explicitly asked for by the application. The first point is solved by the original QUIC spec mandating that the timeout is the max of the stated value and 3 times the RTT. The second point is entirely up to the application.

Suppose a 20 minutes RTT. The default would be to close the connection after an hour of silence (3*RTT). The alternative, zero timeout, would be to allow transmission on the connection to continue after an hour of silence. The expectation is that just continuing would be less costly than restarting a new connection. I am not sure these expectations are true:

1) After a long period of silence, we have an ambiguous situation. Is the other end still there, or not? If it is there, sending more data is mostly fine. If it is not, sending more data will cause packet losses, which will cause the closure of the connection after several retransmission attempts. It is not clear that this is better than a new connection attempt.

2) After a long period of silence, the transmission conditions may have changed. Stations or relays may have moved, the transmission balance of some paths may have been altered, etc. The connection will have to find out the new values, and that process is very similar to starting a new connection.

3) QUIC supports resuming and 0-RTT. If we also remember the RTT and BDP from previous connections, starting a new connection is going to be very efficient.

Even if we want to maintain long duration connections, I think the solution is to mandate a "keep alive" process, such as a ping every 20 minutes, to make sure that the timeout does not expire. This is much more likely to work than just ignoring long silences and hoping that the link is still there.

marcblanchet commented 11 months ago

all good points. I wonder if these should be somewhat put in the draft. To me, there are really relevant, maybe more on the deployment side than pure implementation, but it could also show the choices for the base implementation to choose (aka externalize idle_timeout, making it per connection, per destination, ...