Initial support for PLPMTUD

aochagavia commented 1 year ago

This PR resurrects @djc's old https://github.com/quinn-rs/quinn/pull/804. As of yet, it is just a draft. IMO it makes sense to look at the commits separately (we might want to squash them in the end, though).

Below are some bits of feedback from the original PR that I'm planning to include before considering this PR as ready for review (feel free to suggest more):

[x] Implement black hole detection
[x] Use an accessor for MtuDiscovery::max_udp_payload_size, to better convey that only MtuDiscovery itself is meant to modify it

Looking at the original PR, there is one question that comes up: I am tracking ACKs for MTU probes manually (e.g. without using PackageBuilder::finalize_and_track), because otherwise things like automatic retransmission kick in. That seems different from what @djc did in the original PR (he seems to get ACK and packet loss detection automatically, without additional code). Am I working in the right direction, or did I miss a function I could use to get ACKs + lost packets for free like in the original PR? I messed up and finally figured out how to get things working :)

Closes #69

Ralith commented 1 year ago

I am tracking ACKs for MTU probes manually (e.g. without using PackageBuilder::finalize_and_track), because otherwise things like automatic retransmission kick in.

Per RFC 9000 § Retransmission of Information, QUIC retransmission operates at frame granularity, not packet granularity. In particular note:

PING and PADDING frames contain no information, so lost PING or PADDING frames do not require repair.

So, there should be no need to bypass the normal loss detection machinery. However, note that there is a subtle interaction with congestion control:

Loss of a QUIC packet that is carried in a PMTU probe is therefore not a reliable indication of congestion and SHOULD NOT trigger a congestion control reaction; see Item 7 in Section 3 of [DPLPMTUD]. However, PMTU probes consume congestion window, which could delay subsequent transmission by an application.

aochagavia commented 1 year ago

@Ralith thanks for the feedback! I think I finally figured out how to use finalize_and_track :). The current status of the PR is that it replicates the original changes by @djc, plus black hole detection and tests (check out, for instance, blackhole_after_mtu_change_repairs_itself). Any suggestions so far?

Next week I plan to hunt down the test failure (unfortunately, the test runner doesn't give info about which test actually failed, other than thread panicked while panicking. aborting.)

Ralith commented 1 year ago

unfortunately, the test runner doesn't give info about which test actually failed

Hmm, we should probably set RUST_BACKTRACE=1 in the test environment.

aochagavia commented 1 year ago

Thanks for the suggestions! Next week I'll have time to go through them ;)

By the way, we'll probably want to tweak a few things (please let me know if you think differently):

The PLPMTUD algorithm should use binary search, instead of going linearly through a hardcoded list. It probably makes sense to start the search at 1500 and have a user-configurable upper bound that defaults to 1500 as well (that is what s2n-quic is doing).
The RFC recommends running PLPMTUD every 600 seconds. I guess we will want to implement that as well.

Also interesting: how do you think this change should affect versioning? Once this is merged, endpoints will be doing a few extra requests than usual, which might be unexpected for the crate's users. Related to this, do you think it is worthwhile to provide a way to disable PLPMTUD?

djc commented 1 year ago

Also interesting: how do you think this change should affect versioning? Once this is merged, endpoints will be doing a few extra requests than usual, which might be unexpected for the crate's users. Related to this, do you think it is worthwhile to provide a way to disable PLPMTUD?

I don't think this is reason to do a semver-incompatible version bump, if that's what you are referring to, as long as we don't remove any API that people might currently be referring to. That said, I think we'll want to release 0.10 in the not-too-distant future (see #1515), so it might make sense to release this at the same time.

Ralith commented 1 year ago

do you think it is worthwhile to provide a way to disable PLPMTUD

I think it's fine to silently enable it by default, but I can imagine use cases where it might be unnecessary (known networks) or harmful (endpoints which want to sleep for very long periods). A good approach might be to make the interval configurable through an Option<Duration> or similar.

aochagavia commented 1 year ago

Today I addressed @Ralith's feedback, found the test that was crashing cargo test and implemented binary search.

Here are some interesting bits and a few questions:

Binary search uses TransportConfig::initial_max_udp_payload_size as initial lower bound. The initial upper bound is the minimum of the local EndpointConfig::max_udp_payload_size and the peer's max_udp_payload_size. I have a question regarding this: technically, the local EndpointConfig::max_udp_payload_size is about which sizes we are willing to receive, and not about the sizes we are willing to send, so I'm not totally sure it makes sense to reuse it here (i.e. maybe we should have a separate way of configuring the upper bound for our binary search). Do you have an opinion on this?
As mentioned on Matrix, the docs of initial_max_udp_payload_size mention that "effective max UDP payload size (...) will never fall below [the provided] value". I think we actually should fall below the provided value in the case of blackholes (that is what the current implementation does). Or am I missing something?
Test tests::cid_retirement is failing after my changes. I don't see how that is related to PLPMTUD, so if you have a hunch I'm all ears. Tomorrow I plan to get a diff between the traces with the feature enabled / disabled, to investigate further. Hopefully that will get me closer to finding the cause of the failure.
Finally, I'm planning to add a configurable interval for how often MTUD should run. I like @Ralith's suggestion above, allowing None for cases where MTUD should run only once and Some(_) for cases in which a regular interval is desired.

Ralith commented 1 year ago

None for cases where MTUD should run only once

Why once rather than never?

aochagavia commented 1 year ago

None for cases where MTUD should run only once

Why once rather than never?

No strong preference here, I just thought you meant once. I'll go with never when the interval is None, then ;)

aochagavia commented 1 year ago

Just pushed an updated (and rebased) version of the PR, with updated docs, passing tests and all the features we discussed above.

There are still a few things I want to look into (tomorrow), before marking this PR as "ready for review" (instead of it being a draft):

I'm not totally satisfied with the current form that MtuDiscovery has taken. I'll see if I can refactor it a bit to make it more readable.
I want to double check that our approach to black hole detection is sound, particularly its consequences in the presence of congestion. For that I'll have another peek at s2n-quic and maybe also ask around in the QUIC WG Slack.
I want to double check that all the code that is using mtud.current_mtu() doesn't mind that the returned value of the function is allowed (and expected!) to change.

Also, there is this comment in the old PR, mentioning that we should only enable PLPMTUD on platforms where we are setting the DF IP bit on. Is that still relevant @Ralith?

Finally, I added a commit to make Pair::drive more robust (otherwise it would consider a connection idle even though there were packets waiting to be sent, as I discovered after thoroughly debugging the cid_retirement test).

Ralith commented 1 year ago

technically, the local EndpointConfig::max_udp_payload_size is about which sizes we are willing to receive, and not about the sizes we are willing to send, so I'm not totally sure it makes sense to reuse it here (i.e. maybe we should have a separate way of configuring the upper bound for our binary search). Do you have an opinion on this?

Do we strictly need an upper bound? We could do open-ended "binary search" by growing the probe size exponentially until it starts failing. That'd be cool for Just Working on localhost, LANs with arbitrarily large jumbo frames, and so forth. Though, we probably should take care not to overly sacrifice rate of convergence in a typical environment for these niche cases. I'm not strongly opposed to a dedicated knob, but it's nice to have universal logic when possible. We might want to tune it carefully to test typical internet limits first regardless.

As mentioned on Matrix, the docs of initial_max_udp_payload_size mention that "effective max UDP payload size (...) will never fall below [the provided] value". I think we actually should fall below the provided value in the case of blackholes (that is what the current implementation does). Or am I missing something?

Capturing our discussion in chat: I'm happy to turn this into an initial guess rather than a hard floor.

Also, there is https://github.com/quinn-rs/quinn/pull/804#issuecomment-1172950107 in the old PR, mentioning that we should only enable PLPMTUD on platforms where we are setting the DF IP bit on. Is that still relevant @Ralith?

Yes; I believe we currently only set that bit on Linux and Windows.

aochagavia commented 1 year ago

Do we strictly need an upper bound? We could do open-ended "binary search" by growing the probe size exponentially until it starts failing. (...). We might want to tune it carefully to test typical internet limits first regardless.

The upper bound is not strictly needed, but since the peer and the local endpoint have a configured max_udp_payload_size transport parameter, I think we can take that as a hint. To give you an idea of the performance difference in a typical internet path (MTU = 1500), here are the amount of probes sent before converging depending on the upper bound used:

[lower_bound = 1200, upper_bound = 1500] converges after 3 probes
[lower_bound = 1200, upper_bound = 10000] converges after 22 probes
[lower_bound = 1200, upper_bound = u16::MAX] converges after 29 probes

Note that each failed attempt at a particular MTU results in 3 probes being sent. After the third lost probe, we conclude that the probed MTU is not supported by the network path.

I took the time to experiment using exponential search to find the upper bound before doing binary search. Here are the results:

[lower_bound = 1200, link_mtu = 1500] converges after 14 probes (7 probes to find the bounds [1456, 1712], 7 probes performing binary search)
[lower_bound = 1200, link_mtu = 10000] converges after 30 probes (12 probes to find the bounds [9392, 17584], 18 probes performing binary search)
[lower_bound = 1200, link_mtu = u16::MAX] converges after 24 probes (14 probes to find the bounds [33968, 65535], 10 probes performing binary search)

This leads me to think we should keep the current approach, and require users to configure max_udp_payload_size in conjunction with init_udp_payload_size if they are aiming at >1500 MTUs.

Ralith commented 1 year ago

Nice analysis! I concur that prioritizing convergence for realistic cases should take precedence.

the peer and the local endpoint have a configured max_udp_payload_size transport parameter, I think we can take that as a hint

Hmm, the RFC says:

It is expected that this is the space an endpoint dedicates to holding incoming packets.

and specifies the default value as 65527, so this may not be a good hint in practice, and our current default of 1480 might be inappropriate. For the sake of fast convergence I'd be happy to have a dedicated transport config field for MTUD upper bound, and default it to the appropriate value for a typical internet path. Future work could perhaps look at occasional exponential probes above this if the upper bound is reached.

aochagavia commented 1 year ago

Thanks for the feedback. I ended up introducing a PlpmtudConfig struct that contains the necessary config for PLPMTUD.

Here's another update and some additional questions:

Regarding black hole detection, I was unable to ask about it on the QUIC WG Slack because it is invite-only (I sent an email to the mailing list to request an invitation, but haven't heard anything yet). In any case, we are following the same approach as s2n-quic, which seems sound to me (it was introduced in this PR, but there is no discussion about why they took that approach). The gist of it is that we keep a "black hole counter" that is incremented whenever a non-mtu-probe packet is lost that is larger than 1200 (unless we have received an ACK for a MTU-sized packet with a larger packet number, which would indicate that the path is all right). If the counter reaches 4, we assume there is a black hole. After a black hole is detected, current MTU is lowered to 1200 and a new PLPMTUD run is scheduled 60 seconds later (unless PLPMTUD is disabled). The counter is reset whenever we receive an MTU-sized packet. I'm inclined to follow s2n-quic's approach here, unless someone can come up with a better idea (or can arrange an invite to the QUIC WG Slack).
Regarding the interaction between PLPMTUD and congestion control... The RFCs I am following seem to contradict each other. RFC 8899 specifies that "An update to the PLPMTU (or MPS) MUST NOT increase the congestion window measured in bytes", while RFC 9002 says that "If the maximum datagram size changes during the connection, the initial congestion window SHOULD be recalculated with the new size.". Any ideas on which one we should follow? Or am I somehow misunderstanding the RFCs? There is an issue in s2n-quic's repository related to this, and this morning I posted my findings, so they might have ideas on what to do about it. Otherwise it would be good to bring it up on Slack.
Regarding the IP DF bit, I had a look at quinn-udp and I can't fully conclude from the code which platforms have the DF bit set. I'll assume that @Ralith's comment above is correct and make sure PLPMTUD is only allowed in Linux and Windows (it surprises me, though, that we are unable to do it in other mature operating systems).
Finally, are there any other things I should look at in order to bring this PR into shape, so it can land?

djc commented 1 year ago

Regarding black hole detection, I was unable to ask about it on the QUIC WG Slack because it is invite-only (I sent an email to the mailing list to request an invitation, but haven't heard anything yet).

When did you request an invite?

Regarding the IP DF bit, I had a look at quinn-udp and I can't fully conclude from the code which platforms have the DF bit set. I'll assume that @Ralith's comment above is correct and make sure PLPMTUD is only allowed in Linux and Windows (it surprises me, though, that we are unable to do it in other mature operating systems).

Looks like it's IP_DONTFRAG/IPV6_DONTFRAG for Windows and IP_PMTUDISC_PROBE for Linux. https://stackoverflow.com/questions/4415725/ip-dont-fragment-bit-on-mac-os has some suggestions for macOS (that I suppose we don't implement yet).

Finally, are there any other things I should look at in order to bring this PR into shape, so it can land?

If you think it's reasonably close, mark it as ready and I'll do a full review pass.

aochagavia commented 1 year ago

When did you request an invite?

Yesterday, 10:00 CET (so quite recently). I am not sure my email went through, though, as I didn't receive a confirmation or anything of the sort (I wrote to quic-chairs@ietf.org, as recommended on https://quicwg.org)

If you think it's reasonably close, mark it as ready and I'll do a full review pass.

Thanks! I'll probably do that next week (I want to look into the DF bit stuff first, and maybe refactor the MtuDiscovery struct a bit)

Ralith commented 1 year ago

I'll try to give the code here a closer look this weekend as well.

I'm inclined to follow s2n-quic's approach here

Seems reasonable. We can run some tests to see if this tends to collapse under congestion in practice.

Regarding the interaction between PLPMTUD and congestion control... The RFCs I am following seem to contradict each other

This is odd. I'm not sure it's actually a contradiction, because AFAICT we only ever care about the initial window when we first initialize a path, before MTUD has had any opportunity to run on it. However, that begs the question: what is the purpose/meaning of recalculating the initial window during the connection? Maybe another good question for the slack.

Yesterday, 10:00 CET (so quite recently).

Lars responded to me in <24h (there was no automated email); if nobody's responded by tomorrow we can bug people in the chat directly.

aochagavia commented 1 year ago

@Ralith thanks for the review. I just finished addressing all your comments, and left a few questions / remarks here and there.

While looking at the pacer and its interaction with the congestion controller, I discovered that each congestion controller has a max_datagram_size field, so I assume that will need to be updated based on the detected MTU. I'll look into that tomorrow.

aochagavia commented 1 year ago

So today I've been looking into updating the congestion controller's max_datagram_size when we have discovered a new MTU. Unfortunately, the current API design seems to go against it. We currently offer a direct way to set max_datagram_size, independently of the path's MTU, and it would be pretty unexpected if we would now link both together. IMO if we want the congestion controllers to use the path's MTU as their max_datagram_size, we would need to rethink the API. Basically, we would need to remove the max_datagram_size function from the controller's config struct (which is a breaking change), and start using the current MTU instead. This might also mean that we need to automatically derive the values of other config fields, like initial_window and minimum_window, because they depend on the value of max_datagram_size. Overall it feels like a pretty invasive change, and since we are already allowing max_datagram_size and the current MTU to diverge, I'd rather postpone it.

I also looked into enabling PLPMTUD only on platforms that set the IP DF bit. The technical side is straightforward, but the API design side less so. Assuming we want to keep quinn-proto free of platform-specific code, we can either:

Have TransportConfig::default enable PLPMTUD by default. Direct users of quinn-proto would need to make sure they are properly setting the DF bit on their IO implementation, or otherwise disable PLPMTUD through TransportConfig::plpmtud_config(None). On our side, we would make quinn::Endpoint ignore the PLPMTUD configuration on unsupported platforms. Unfortunately, this requires a backwards-incompatible change (as far as I can see), because quinn::Endpoint has read-only access to the TransportConfig (through Arc<TransportConfig>) and we would need to mutate it.
Have TransportConfig::default disable PLPMTUD by default. The function to enable it should document the fact that PLPMTUD is platform-specific and that you should enable it at your own risk. In a later PR we could investigate the possibility of making quinn::Endpoint raise an error if PLPMTUD is enabled on unsupported platforms (it would require additions to quinn-proto's public API, so I'd rather postpone this part).
Introduce a PlatformConfig type in quinn-proto that indicates platform support for particular features, such as setting the DF bit on outgoing UDP packets. We could have a PlatformConfig::default_transport_config(&self) -> TransportConfig method that generates a suitable default TransportConfig, with PLPMTUD enabled if setting the DF bit is supported. The PlatformConfig instance would be created by users of the quinn-proto library.

All in all I think option 2 is the most viable, so the current implementation disables PLPMTUD by default.

With all this said, @djc I think the PR is ready for a thorough review, so I have marked it as ready.

Ralith commented 1 year ago

Basically, we would need to remove the max_datagram_size function from the controller's config struct (which is a breaking change), and start using the current MTU instead.

That sounds correct to me. This breakage is fine; we already have breakage pending/planned, and most users probably aren't using this interface. That said, happy to leave this for future work.

All in all I think option 2 is the most viable, so the current implementation disables PLPMTUD by default.

Since quinn-proto is independent form a specific I/O backend, this makes sense to me. A similar issue exists for segmentation offload configuration; future iterations on this should probably address both. I like the idea of a PlatformConfig struct to make this easier for users to get right, but it's fine to leave that for future work too. In the mean time, we should at least update the main examples and performance test/benchmarking tools to enable MTU on linux/windows.

aochagavia commented 1 year ago

Just rebased on top of latest main and pushed a commit modifying the benchmarks, perf, and examples. I think there isn't anything left to be done on my side. @djc could you have a look when you have the time?

The next things I plan to pick up in follow-up PRs:

Change the default max_udp_payload_size used when creating an Endpoint (from 1480 to 65527), as suggested in RFC 9000 and by @Ralith in this comment.
Make sure the congestion control algorithm uses the current PMTU, instead of a hardcoded constant. Modify the API to configure congestion control accordingly (see this comment for additional context).
Provide an easy way for users to create a platform-specific TransportConfig, maybe in the form of a PlatformConfig struct as described in this comment. Start using it in our examples and benchmarks.

aochagavia commented 1 year ago

@djc I just pushed an amended (and rebased) version of the commits that address all your suggestions

aochagavia commented 1 year ago

@djc just pushed changes addressing your comments

quinn-rs / quinn

Initial support for PLPMTUD #1510