What to do on path time-out?

mirjak commented 6 days ago

With use of multipath you have a multiple path open but not use all of them concurrently (e.g. for stand-by). Therefore a path time out makes less sense. We probably need to consider if the path if actively used or not. Only you send on a path and all packets are finally marked as lost, that's probably a good indication that the path Is broken and you should close it. If a path is not used, you still might want to keep it open to use it later. Not sure if requirement to send pings on all non-used path is that useful.

Another related issue is that if you only send non-ack eliciting packets on a path (like ACKs) and the path is not used by the peer for sending, you might not be able to detect a path breakage. May it would make sense to require sending ack electing frames on all used paths from time to time?

huitema commented 5 days ago

The timeout issue was already addressed in PR #377, see section "Idle Timeout". The main drawback of the current text is that it forces some keep alive traffic for paths kept in standby. I think that's a reasonable compromise: if a path is never used for a very long time, there is no guarantee that it will still be available when the endpoints decide to to use it.

mirjak commented 4 days ago

That's exactly the point that I would like to discuss further. Currently the text in the "idle timeout" section says:

Hosts SHOULD stop sending traffic on a path if for at least the period of the idle timeout.

Note that PR #377 did not change this text. However, I think that the assumption that if a path is not used for a while it is automatically broken is not useful, especially if you keep a path open as a standby on purpose.

The question if you want send keep-alive traffic on such a path is independent for me because this addresses rather the question when to detect a path failure. I.e. I think there is nothing wrong to keep a path open without using it, then if your actively-used path fails, you switch over to that path. If that is then also not working, you can try to establish another path or close the connection. If you think it's likely that your standby path will break, of course it can make sense to actively probe liveliness to avoid delays when you later actually try to use that path. However, I don't think we should require a path to be closed after an idle time or sending of potentially unnecessary probing traffic.

huitema commented 4 days ago

I think that we agree. The text in the "idle timeout" section says two things:

specify an idle timeout behavior similar to what was in previous drafts
require that if a path is abandoned because of idle timeout, the endpoints must explicitly abandon it.

The second point is sound. We could say the first one differently, and treat the decision to abandon paths after timeout as a local behavior. But if an endpoint is going to use an idle timer it is better to tell it to the peer, because the peer may want to send keep-alive traffic to avoid the closure, and it needs to know the path timeout to parameterize the keep-alive traffic.

mirjak commented 3 days ago

So, yes I propose to only change the first point because there is no need to enforce anything here in the multipath case. If you only have a single path and it breaks, you have to close it at some pre-agreed time because you can't communicate something else to the other peer anymore. However, in the multipath case as long as you still have at least one working path, there is actual no really good reason to force the close by an timeout because any peer can at any time gratefully close the path by sending an abandon frame on another path.

This can be useful if e.g. an interface goes down but comes back after a short time. Of course you have to remember that a path is currently not working and somehow recognise when it's up again (by probing it?). But not sure we need to specify much here.

Alternatively, I think it would probably make sense to have different timeout for the connection and for (sub-)paths and we could revisit that question. But not enforcing anything and just leaving it as local decision is the easiest.

On your text above: If the peer keeps sending keep-alive traffic and the endpoint receiving and acknowledging it, then there should never be a time-out no?

Maybe there are cases where an endpoint could decide to close a path that is only used for keep-alive traffic but that's a different case that we maybe should discuss separately (in the implementation considerations section?).

quicwg / multipath

What to do on path time-out? #397