Switch to standby path more tricky than expected

huitema commented 5 months ago

I did a big batch of debugging the "unique path-id" code in picoquic, porting all the tests that were designed for the previous multipath version, and I found an interesting issue in the "standup" test. The test starts by setting a client to server connection with two paths, one available and on in standby. It runs for a while, then simulates cutting the "available" path off. Expectation is that the connection will continue with the "standby" path. The test was initially failing.

The simulated traffic is from server to client. The server quickly detects that the available path is down, and starts sending data packets on the "standby" path. But the client only sends ACK, and does not react quickly if ACK packets are not acknowledged. So the client keeps sending ACKs on the "available path". Of course, since the path is cut, they are dropped. The server does not see ACKs for the packets sent on the "standby" path, so it quickly concludes that this path is down. Pretty soon, the connection breaks.

There are two potential fixes. One would be to somehow force ACKs on the standby path, if packets are received on that path. The other, which I feel is more robust, is for the server to send mark the standby path as "available" if the available path is "broken". I did that, "promote" the standby path to available, and the "standup" test now succeeds.

This issue might deserve some discussion in the multipath draft.

huitema commented 5 months ago

A lot of this is a tradeoff between simplicity and performance. For performance reasons, we want the traffic to start using the standby path if the available path is "dubious". In my implementation, that's on first PTO. Of course, the PTO may or may not be due to a link failure, but hopefully is is infrequent enough that using the standby path briefly in case of PTO does not break the "spirit" of putting a path in standby.

Sending an "abandon path" immediately would also force traffic onto the standby path, but is is more drastic. If the packet loss situation was temporary, it causes the system to stop using the "available" path forever. In contrast, promoting the standby path to "available" is easily reversible. if after a PTO or a couple RTO the "available" path is restored, the client can decide to put the standby path back in standby mode.

mirjak commented 5 months ago

I actually think it is more clear to explicit close a path if you detect it's broken. if the path comes back (whatever that means), you can simply try to open it again or actually a new path in this case. However, we can enforce this behaviour in ether way, therefore I think we should discuss the issue and maybe explain different solution but don't make any strict recommendation.

huitema commented 5 months ago

You don't know when the path comes back. If you urgently need it, then you need to send probes regularly. If the path context is still up, you can do that by sending a ping, or repeating a path challenge at short intervals. If the path is gone, you need to create a new path, send a challenge, etc. If the challenge fails, you should also send an abandon -- because the peer maybe received the challenge but the response did not make it. So you consume the "number of paths" resource, and also the "number of CID".

mirjak commented 5 months ago

If you send pings you are supposed to close the path after a timeout.

Also if the path comes "back", you really don't know if that is still the same path. I think it would be much safer to send a path challenge.

This is what we currently say about recent addresses:

   Section 9.3 of [QUIC-TRANSPORT] allows an endpoint to skip validation
   of a peer address if that address has been seen recently.  However,
   when the multipath extension is used and an endpoint has multiple
   addresses that could lead to switching between different paths, it
   should rather maintain multiple open paths instead.

qdeconinck commented 5 months ago

This issue reminds me the experience I had when trying to optimise the latency of applications on smartphone devices with MPTCP and older non-standard MPQUIC implementations where the WiFi is considered as cheap and the cellular expensive. When smartphone users are initially connected to WiFi and moving away from their access point, they may eventually be out of WiFi reachability, leading to packet losses. There is often some delay between applications experiencing packet losses and the system declaring the WiFi as lost. During that timeline, the WiFi path acts as a blackhole.

The path priority (available/standby) is a scheduling concern, and each endpoint runs its own algorithm. It is up to each endpoint to determine when to start using "standby" paths. The "path health status" is also a local information, and depending on your traffic (fully upload or fully download), only one endpoint may be aware of a lossy path. Scheduling decisions at both sides impact the performance of the multipath transfer.

I see two ways of handling this:

Either we define some informative frame about the "path health status". I had such a concept in my research experiments at that time, where a "path health status" frame advertises to the other endpoint that the related path suffers from connectivity issues. It is unlikely that we are going to define such a frame in the core multipath draft.
Or we could suggest to implementers (implementation considerations) to make a "pure receiver" (i.e., only receiving data, not sending any) to generate ack-eliciting frames (such as PING) from time to time to proactively detect that there is connectivity issue on that path. Probably the way to go for this draft.

huitema commented 5 months ago

@qdeconinck "from time to time" is the issue. There is bound to be a delay between the time the sender notices packets are not getting acked and the time a pure receiver notices that the occasional PING is not acked. The pure receiver will also tend to use longer estimates for the RTO. So we get the sequence:

"Available" path breaks
Sender notices a PTO, may start sending traffic to "standby" path
Receiver notices a PTO, may start sending ACKs through "standby" path
Sender notices too many RTO, decides to abandon path
Receiver also abandons path.

The issue that I find is that if the gap between (2) and (3) is too long, the sender will also notice a PTO on the "standby" path, because the ACKs are sent by the receiver through the "available" path, and lost. My "solution" is:

"Available" path breaks
Sender notices a PTO, start sending traffic to "standby" path, sends "PATH_AVAILABLE" to promote the standby path.
Receiver gets the "PATH_AVAILABLE", starts sending ACKs on the standby path -- and typically only on that path after noticing a PTO on the old path.
Sender notices too many RTO, decides to abandon path
Receiver also abandons path.

@mirjak proposes to just rely on "ABANDON_PATH". That's doable, but it results in:

"Available" path breaks
Sender notices a PTO, but does not change the scheduling of packets
Receiver notices a PTO, but does not change the scheduling of packets
Sender notices too many RTO, decides to abandon path, sends traffic through standby path
Receiver also abandons path, sends traffic through standby path.

That will work, but that means the traffic resumes after "too many RTOs", which is a pain for several applications.

quicwg / multipath

Switch to standby path more tricky than expected #293