quicwg / base-drafts

Internet-Drafts that make up the base QUIC specification
https://quicwg.org
1.63k stars 205 forks source link

Explicitly communicate and include ACK delay in RTO computations #981

Closed huitema closed 6 years ago

huitema commented 6 years ago

The recovery spec says:

RTT is calculated when an ACK frame arrives by computing the difference between the current time and the time the largest newly acked packet was sent. If no packets are newly acknowledged, RTT cannot be calculated. When RTT is calculated, the ack delay field from the ACK frame SHOULD be subtracted from the RTT as long as the result is larger than the Min RTT. If the result is smaller than the min_rtt, the RTT should be used, but the ack delay field should be ignored.

This is fine advice when computing the RTT, but there is a small issue arising later:

When the final TLP packet is sent, the RTO period is set to max(SRTT + 4*RTTVAR, minRTO)

I believe that this RTO formula should also include some estimation of the ACK delay. When doing simulation with low latency links, I observe that SRTT converges to the low delay and RTTVAR to about zero. The main element in the overall delay is in fact the ACK delay. This leads to two possible solutions:

What do you think?

ianswett commented 6 years ago

I think the latter approach makes the most sense.

I guess that would be SRTT + 4*RTTVAR + MaxAckDelay?

martinthomson commented 6 years ago

Does this mean that we are advertising MaxAckDelay?

huitema commented 6 years ago

I would not advertise MaxAckDelay. But I would compute statistics based on the Ack Delay values observed in incoming ACK.

janaiyengar commented 6 years ago

I was thinking about exactly this last week, and I agree with Ian that the ack delay should be explicitly included in the RTO computation. This allows us to basically only combine the components where useful. We could allow for a windowed max, or have some way of decaying the max, but I think using a max ack delay makes most sense.

I agree with Christian that we can simply compute this from observed values.

On Dec 3, 2017 5:37 PM, "Christian Huitema" notifications@github.com wrote:

I would not advertise MaxAckDelay. But I would compute statistics based on the Ack Delay values observed in incoming ACK.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/quicwg/base-drafts/issues/981#issuecomment-348839129, or mute the thread https://github.com/notifications/unsubscribe-auth/AKjg1AiRYpoa1FJVNC5AZVX3CmOfTs0Cks5s80zvgaJpZM4Qy1_D .

ianswett commented 6 years ago

I don't think doing this requires explicitly communicating MaxAckDelay, but I think it would help.

My understanding is that TCP would greatly benefit from having an explicitly communicated max ack delay and there's a proposal in TCPM to explicitly communicate it, so it seems useful to communicate in QUIC.

Trying to infer the max based on observed data is substantially more complex and is going to be wrong if a connection hasn't been active for long.

huitema commented 6 years ago

To Ian's point: I would not mind seeeing a "max ack delay" in the transport parameters. But my point on statistics was different. The idea would be to compute the likely ack delay range based on observations. This will be capped by the max delay, in the same way that RTO is capped by max RTO. But it will often be much smaller.

janaiyengar commented 6 years ago

As I think about using max delay, I don't think that's a useful parameter for us. I'll try to explain my reasoning -- I could be wrong, of course.

All this makes me think that we want to use ack delay, a max filter, and not use max ack delay as a parameter. @ianswett @huitema : do you disagree?

martinthomson commented 6 years ago

Can someone write out an example algorithm for determining the maximum? I'm not sure how to manage the value over time or whether it's safe to assume that it won't change.

ack_delay_corrected = min(last_acknowledged.time_received - last_acknowledged.time_sent, ack.ack_delay)
max_ack_delay = max(max_ack_delay, ack_delay_corrected)

Is it this simple?

mikkelfj commented 6 years ago

You might consider https://en.wikipedia.org/wiki/Exponential_smoothing

ianswett commented 6 years ago

@martinthomson That's what I'd start with, but I'm concerned it could be substantially incorrect in some cases.

In particular, early in a connection it's going to be wrong for a while and if the peer wants to decrease the ack delay, it's going to stay high, unless an approach like @mikkelfj 's suggestion of exponential smoothing is applied.

To prevent spurious retransmits, we want this number to estimate slightly high, under anticipation that previous measurements may be slightly lower than the real max.

All this is doable, but seems a lot more complex than just sending the value to the peer. We could run some experiments with gQUIC to try to compare the case of knowing a fixed value(25ms) to estimating this value and seeing how far off it is in reality?

TLDR: Seems plausible, but arriving at a solution that works well in practice seems like real work involving both more thinking and experimentation.

janaiyengar commented 6 years ago

I believe so. It's straightforward if you're trying to a connection-wide max. Slight correction to your code, since the max we're tracking is for a different thing:

max_ack_delay = max(max_ack_delay, ack.ack_delay)

This is good enough, and is safe enough. TCP conflates this ack delay with the SRTT measurement, and puts everything through an EWMA filter instead of the max filter that is being proposed here. Of course one could argue that we should also use an EWMA for this, but I think max is a better choice. It becomes a bit more complicated if we want to decay the max over time, but I think a connection-wide max is adequate.

Subsequently, the RTO code in Section 3.4.7.4 of the recovery draft becomes:

alarm_duration = smoothed_rtt + 4 * rttvar + max_ack_delay

It's safe to assume that the value won't go down -- a connection-wide max is conservative and is safer than any option that decays the max.

martinthomson commented 6 years ago

@janaiyengar, I agree about favoring larger values (which suggests that reduction isn't necessary), but why not bound the value by the observed delay? RTO determines the time that an endpoint holds state, so an unnecessarily inflated ack delay could be used as part of a DoS.

janaiyengar commented 6 years ago

ianswett@: Under what conditions do you see a client wanting to reduce (and being able to control precisely) the ack delay over time?

Bear in mind that all this is being conservatively used under a minRTO anyways.

janaiyengar commented 6 years ago

@martinthomson, sorry I misread. Your code works better.

huitema commented 6 years ago

I agree that Martin's code just works. There is just one caveat. I would only consider samples where the ACK acknowledges an "important" packet, instead of a pure acknowledgement. Otherwise, you have the scenario in which a peer receives an ACK packet, does not acknowledge it, and some time later sends a keep alive packet -- PING + ACK. In that case, the ACK delay can be way over the "maximum".

TCP LEDBAT does something a bit more complex to estimate min RTT. It computes the minimum RTT over a sliding window of 1 minute slices. The reason there is that the minimum can vary over time, when routing changes, and using the wrong minimum in LEDBAT leads to poor control. We can argue that there is no such time dependency for the ack delay: it is purely determined by conditions at the peer. So the simple algorithm is better, "max over the duration of the connection."

By the way, there is something similar going on with the _time_reorderingfraction and _reorderingthreshold variables. They too can probably be estimated by observing the ACK. But that's probably for another thread.

mikkelfj commented 6 years ago

Connections can be very long lived and experience intermittent congestion. Would it make sense with a hard maximum + 10% over the last 5 minutes, or say, half hour?

janaiyengar commented 6 years ago

@huitema : The reason LEDBAT does a windowed min is for safety -- if the min went up, but the sender is not tracking an increased min, this leads to an over-aggressive sender. Considering LEDBAT is a scavenging sender, this is bad. The problem is (i) quite the opposite in the current case, since keeping the max is the conservative option for max ack delay, and (ii) that the RTO is inflated a bit is not anywhere near as important as the min RTT estimator for a delay-based congestion controller.

Your point on using ackable packets (what you call "important" packets) is spot on. I don't think we say anything about that in the spec, and we should.

@mikkelfj : Ack delay is not impacted by network congestion. Ack delay measures the amount of delay experienced at the host, from receiving a packet to when an ack is sent.

martinthomson commented 6 years ago

I think that @huitema's suggestion to remove ack-only packets from consideration is a necessary amendment.

We recently added min_rtt calculations. It would be good if we could learn from other examples here. That's currently a simplistic minimum, and unlike max_ack_delay it might benefit from some more sophistication.

mikkelfj commented 6 years ago

@janaiyengar thanks for clarifying.

janaiyengar commented 6 years ago

@martinthomson : We should be careful though. We added min RTT calculations in QUIC for bounding the ack delay, and I would not use a more sophisticated min estimator for just ack delay. The LEDBAT controller is fundamentally wrong if the sender's min RTT estimate is incorrect, justifying the additional complexity in min RTT.

Delay estimation for congestion control tends to be more demanding. Delay estimation for loss detection/recovery doesn't need that kind of accuracy.

martinthomson commented 6 years ago

984 implements the suggestion to remove ACK-only packets from RTT calculations.

ianswett commented 6 years ago

@huitema Agreed that we should only be considering packets that are expected to generate acks(aka retransmittable) packets when looking at ack delay.

@janaiyengar A client may want to reduce the ack delay once it realizes the peer is very close. For example, I can imagine a datacenter type environment where the ack delay is reduced to some very small value if the RTT is <1ms? LAN/Wifi also comes to mind. It's possible this can be accomplished by always immediately acking handshake packets and only clueing the peer into the actual ack delay post handshake. Or just lying about the ack delay until you've decided what the ack delay should be?

In terms of RTO, one thing I forgot is that TCP assumes that if multiple packets are outstanding, delayed acks won't be an issue, either because we'll immediately send an ack if there's a gap or if two packets are received, we'll send an ack immediately.

I believe this and the fact ack delay would be implicitly included in SRTT and RTTVAR are why TCP doesn't add ack delay into the RTO time.

I don't think QUIC implementations should assume this acking behavior, since there is some evidence sending acks much less often than every 2 is highly beneficial.

I'm somewhat unhappy about this, however, as it leaves us in somewhat researchy behavior in three categories: 1) Recommending an ideal acknowledgement behavior.
2) What assumptions a peer can(or can't) make about the other peer's acknowledgement behavior(ie: are past delays indicative of future ones? Are multiple packets treated differently from single packets) 3) Algorithms for determining ack delay.

@huitema reordering_threshold doesn't have this problem because it's in packet number space, so it's pretty straightforward. time_reordering_fraction could, but it shouldn't as long as when ack-creating packets are received that are prior to the largest acked, acks are sent as soon as possible.

ianswett commented 6 years ago

@martinthomson Re #984, We could exclude ack-only packets from other RTT calculations, but I don't think that's necessary. I think it's only necessary for creating a model of the max ack delay, since there's no delayed ack timer for acks.

janaiyengar commented 6 years ago

(Apologies for the long note.)

@ianswett : Yes, you are right that TCP doesn't communicate ack delay explicitly and that ack delay is part of the SRTT / RTTVAR and therefore RTO. On TCP receiver behavior, this is specified. RFC 5681, Section 4.2 requires that acks be sent for at most every other packet. (We don't require this yet in the spec, but we do need to specify "correct" behavior here. We should take that to a different issue though.)

Min RTO and TLP in TCP use numbers based on what is expected receiver ack delay based on what's commonly used, not what the peer actually uses. At the moment, TCP and QUIC both have a fixed delayed ack timer. The problem MAD fixes in TCP is that it a receiver is unable to indicate to a sender that the specified delayed ack timer in the RFC is much larger than the one that it is actually using. Explicitly stating this helps the sender make TLP and RTO timers faster, since they are now based on knowledge of peer behavior than specified values in the RFC. (See slides 3 and 4 in https://datatracker.ietf.org/meeting/97/materials/slides-97-tcpm-tcp-options-for-low-latency/)

By explicitly noting the ack delay in acks, QUIC already solves this problem for the most part.

The difference between using a statistical max of observed ack delays vs a pre-configured explicitly indicated max ack delay is a separate problem. The peer might tell us that it's using a max, but as I pointed out earlier, a peer may not be in a position to limit the actual observed max, simply because OS scheduling delays are not controllable and not necessarily known.

For a peer that is unsure about what it wants to use as its own max ack delay, as @ianswett says, it can simply control observed max ack delay by not delaying anything until it knows better. Commonly however, and as we do in GQUIC, peers will simply use a fixed max. In both cases, using a statistical measure over observed ack_delays works well.

As I think about this, I realize that I missed something earlier. The max ack delay should inform our TLP and RTO min values. So, based on MAD for TCP (see Section 3.5 in https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00), we'd actually want to:

alarm_duration = smoothed_rtt + 4 * rttvar + max_ack_delay

instead of the current:

alarm_duration = smoothed_rtt + 4 * rttvar
alarm_duration = max(alarm_duration, kMinRTOTimeout)

I'd like to also keep the ack delay out of the SRTT estimate (as the QUIC recovery draft currently does), since we add it back in during the RTO computation, even though this would be a difference from TCP.

martinthomson commented 6 years ago

This all seems sensible, but remember to document these little divergences from TCP.

ianswett commented 6 years ago

Jana and I discussed this today and he convinced me there's no need for explicitly communicating ack delay, at least until it's shown to be a problem.

The plan is: 1) alarm_duration = smoothed_rtt + 4 * rttvar + max_ack_delay 2) Justify why max_ack_delay is being added to RTO time, which is because we're subtracting the peer's ack delay from SRTT and RTTVAR , so both of those are expected to be smaller than their TCP equivalents. 3) max_ack_delay = max(max_ack_delay, ack.ack_delay) where ack.ack_delay is an ack delay that was used in an RTT sample. Peer communicated ack delays that are ignored because they cause the RTT to drop below min_rtt are ignored to avoid inflating the max_ack_delay. Any time the largest acked is an ack-only packet, the ack delay is also ignored, as @huitema suggested. 4) No change will be made to min RTO at the moment, but we'll open another issue to discuss whether that should be kept. 5) We may want to set max_ack_delay to a default of 25ms until at least one delayed ack is observed. I can imagine some algorithms here, but I'd rather not specify anything(ie: just initialize max_ack_delay to 0) until we are happy with the algorithm.

@janaiyengar Feel free to correct me, and others should feel free to comment if they think anything here doesn't make sense.

janaiyengar commented 6 years ago

Thank you for summarizing, @ianswett, this plan SGTM.

pravb commented 6 years ago

I missed this engrossing discussion. Bullet 5 is a problem. The issue today in datacenters is that minrto is set by the sender and ack delay is set by the receiver but there is no way to negotiate this value. Windows tries to solve this by using the handshake RTT to pick aggressive values for both minrto and delayed ACK timeout but it is less than ideal since both ends could measure the handshake RTT differently. Starting out with a fixed 25 msec we are preventing a much lower minrto in low latency intra-DC scenarios. I think we should still negotiate the default starting value in the transport parameters as proposed in the TCP MAD option. Thoughts?

ianswett commented 6 years ago

I'm happy to reopen this, since no one has any experience with the current suggestion of measuring max ack delay and the TCP MAD option is seen as valuable by the TCP folks for the reasons you suggest.

pravb commented 6 years ago

To clarify, my only additional request is that we negotiate the initial delayed ACK timeout value in the transport parameters. Is there any harm in negotiating this for the initialization so we cover the low latency scenarios?

huitema commented 6 years ago

@pravb , it seems that your main requirement is to not impose a floor of 25ms to the max ACK delay, correct? In that case, the simplest solution is to treat it much like the RTT. Use the 25ms value as long as no ack delay has been measured, replace it by the first observed value when it comes. In practice, the value will be set when the first handshake packet is received.

That only leaves open the case of the very first packet, the client initial. But there is no way to use negotiation there, since by definition the client has not yet heard anything from the server. It boils down to local configuration.

janaiyengar commented 6 years ago

@huitema: I would characterize the requirement slightly differently -- what @pravb seems to be saying is that that the floor on the minRTO is too high, which is was the motivation for MAD. I agree with @huitema on not needing this in the transport params since the first RTT, even for MAD is unknown -- this is basically the SYN RTO. We can do exactly what TCP achieves with MAD without the transport param, since we dynamically receive the ack_delay in received acks.

pravb commented 6 years ago

replace it by the first observed value when it comes We don't know if the first observed value includes a time for a delayed ACK timeout. In fact sender won't know the timeout until the receiver decides to delay the ACK which depends on the ACK generation algorithm. So minrto may remain aggressive based on just using RTT until that happens?

janaiyengar commented 6 years ago

@pravb : That seems appropriate, doesn't it?

You can get a max(ack_delays) to find out the max that the receiver has delayed by in the past, and use that to predict. This has all the same problems of predicting based on past values, but if a max is specified, then a receiver can prime the sender's minRTO by delayed a few acks deliberately to the max timeout it believes it will do.

MAD in TCP specifies the max that a receiver will delay by, but it doesn't leave room for adaptation, and it requires a receiver to know the max it will end up delaying by. This is more plausible in the kernel, but in userspace, where scheduler delays may be included, it's not clear that the receiver can predict this max.

ianswett commented 6 years ago

Question: Since the ability to change how minRTO is calculated is a big motivation here, possibly we should do that now?

QUIC initially specifies that the handshake timer should be armed and the RTO timer should not, so the RTO will never fire until the handshake completes, which may provide enough time to get a reasonable estimate of RTT, RTTVar and max ack delay? At which point, the Min RTO could be removed entirely now that RTO includes a max ack delay term?

janaiyengar commented 6 years ago

I'm personally strongly in favor of eliminating minRTO for QUIC in particular, but let's make that a separate discussion (because that might turn into a flamewar). I don't think we should do that on this bug.

I'll file a new bug.

pravb commented 6 years ago

@janaiyengar: priming deliberate ACK delays seems suboptimal to me. I agree MAD doesn't allow for adaptation and I like that QUIC is incorporating that but I still am not convinced that we don't need a seed value. Maybe we should first close on #1017 and come back here.

Lol I hope it will not be a flame war if it is well justified. Maybe orthogonal but I would like us to aim for a design goal of fair sharing between QUIC and a TCP flow on the same bottleneck link since we are going to live with both protocols coexisting for a long time to come.

janaiyengar commented 6 years ago

@ianswett suggested earlier that we use a transport param, similar to MAD, which we can still do to prime the peer. The peer can still run a max estimator but explicitly primed via this param. I think I'd be fine with that.

I was kidding about the flamewar :-) but I do think that discussion is separate. I don't think we need to resolve #1017 before wrapping up this one... we can retain a fixed minRTO, and use the max of ack delays for the RTO computation as discussed here.

ianswett commented 6 years ago

I think eliminating MinRTO, as discussed in #1017, may be a bit safer in the presence of both measured and explicitly supplied max ack delay.

Without the explicitly supplied MAD, a spurious RTO occurs if no acknowledgement is received within 3x the initial RTT measurement(SRTT + 4*RTTVAR). Given peers know this and the client typically gets an RTT signal first, possibly the server should send an ack immediately until it has an RTT signal? Even then, the process of starting to send delayed acks seems prone to cause spurious RTOs, particularly if the RTTVAR is small. There's a weird incentive to ensure a delayed ack is sent sometime early enough that the RTTVAR hasn't dropped.

ianswett commented 6 years ago

I'm parking this until I or someone else can provide some data that shows this is valuable and when.

ianswett commented 6 years ago

Jana, Martin and I are increasingly thinking this is worth adding, so unparking it.

pravb commented 6 years ago

Just a heads up that per a recent discussion with Yuchung, MAD is no longer being pursued for TCP.

ianswett commented 6 years ago

Thanks Praveen, when I talked to Neal, it sounded like the top two concerns about MAD for TCP were deployability related.

The third was an inability to update the value later in the connection if the initial value was too large or too small. I think that's still an issue with the proposal to put it in transport params. I'm a fan of adding an UPDATE_TRANSPORT_PARAMS frame to fix this and a few other cases we'd like to update values, but I can imagine other solutions.