Key Update and time to next possible update

gloinul commented 9 months ago

So Section 5.3 in -06 says:

When this specification is used, endpoints SHOULD wait for at least three times the largest PTO among all the paths before initiating a new key update after receiving an acknowledgement that confirms receipt of the previous key update. This interval is different from that of QUIC version 1 which used three times the PTO of the only one active path.

After having done some testing with key update in our implementations with different length paths some conclusions have been drawn.

First is that it difficult for an endpoint to know when an old key can safely be dropped. With multiple path being possible to send on, and not necessarily used as well as to send ACKs on. Thus the fact that one get key phase back as a responder indicate that beyond on-path reordering things should be settled. But for multipath one can conclude for paths being used that key-phase changes have happened. But for path currently not receiving any packet one don't know if they are just being significantly delayed or may never have been sent. Also the path RTT samples one have might not be relevant for the current path delay. Thus, beyond observing that one get some key-phase change back, and starting a really long timer it is not obvious that you can generate an algorithm that enables one to know that no outstanding old keys are there.

The fact that this time needs to be longer than worst case RTT changes on each path do result that using 3*PTO is likely to short to re-enable key-update. At least unless one want to require trial decryption of the packets. This as keeping the old key while using the current key, so when one see a key-phase different one can try if it is the old key, if that fails and the next key works, then one drop the old key, and from that point assume this key-phase will be the new key. And thus initiating a key_update.

kazuho commented 9 months ago

I think I agree that the problem exists, though I am not sure how much it would matter in practice.

Separately, I might point out that this is an existing issue of RFC 9001.

In both QUIC v1 and multipath, endpoints can have idle paths, which might have very different actual RTT than the previous estimate. Also, when the endpoint initiates a Key Update while the peer is trying to open a new path, RTT of the new path cannot be taken into consideration.

When recommending the Key Update interval, QUIC v1 only took the RTT estimate of its one and active path into consideration. We adopted that model to Multipath and said that all paths that exist have to be taken into consideration.

I think that is an improvement from QUIC v1 in sense that idle paths are taken into consideration at least (even though their RTT estimates are old).

The question is if we want to do more.

Honestly, I do not care much considering how infrequent key updates are. My anticipation is that this problem would be a minor source of loss events unrelated to congestion, if any.

Considering that, it might be sufficient to just note that such a problem exists. We can address the issue in the future version of QUIC.

The other idea would be to state that the minimum recommended interval is 3*max(maxPTO, initialPTO); by changing as such, we can take paths that are being created into consideration. But that is not a perfect solution.

mirjak commented 1 month ago

I don't think we need to change anything here. As noted using 3x the largest PTO is really meant as a safety margin to capture most cases but it might not be able to capture all. However, given how rare and unlikely this problem is, I don't think it the end of the world if you loss some packets and have to retransmit them. Also I don't think there really is much difference to RFC9000. I guess the only other thing we could do is to recommend to close all idle paths or try to get a fresh RTT sample by sending a PING before you do the key update. But is that really needed?

gloinul commented 1 month ago

So I think this is a problem that is not expected to be experienced in real-world usage. Our interop testing move this to be way more aggressive than would otherwise ever be used. However, the main point is that 3*PTO might not be sufficient security margin, even if it would generally work. Documenting that a risk could exist that could result in decryption failures if the RTT estimates in PTO are wrong.

quicwg / multipath

Key Update and time to next possible update #290