quicwg / base-drafts

Internet-Drafts that make up the base QUIC specification
https://quicwg.org
1.63k stars 205 forks source link

Desirable behavior when it takes time to derive the traffic keys for the next PN space #3821

Closed kazuho closed 4 years ago

kazuho commented 4 years ago

During the handshake, there are moments when an endpoint receives a packet from its peer and calculates the traffic keys of the next PN space. To give some examples, a server receives an Initial packet containing ClientHello, and derives the Handshake keys. Or, a client receives the Handshake packets, verifies the handshake transcript (including the certificate chain) and derives the 1-RTT keys.

While deriving the traffic keys of the next PN space, an endpoint might receive packets belonging to the next PN space. Note that the process of deriving the traffic keys might take some time.

I see Chrome running on a PC spending ~100ms for verifying the certificate chain, when it is launched from the command line along with the URL to be fetched. I presume that such launch-to-fetch-URL behavior would not be uncommon on PCs and handheld devices. Another (more) extreme case would be when OCSP is involved.

In these cases, the time required for deriving the new keys could take more than a handful of RTTs.

Hence the following questions, that might be worth clarifying in the specifications:

Re Q1.1, I think that the answer is yes, it should. Such an endpoint should calculate the next set of traffic keys asynchronously. It should continue running the QUIC transport for the existing PN space. To give an example, while the client verifies the certificate chain, it should ACK and continue ACKing the Handshake packets that it receives.

Re Q1.2, I think the answer is no. An endpoint SHOULD buffer packets of the next PN space while it calculates the traffic keys for that PN space. This is because the peer cannot distinguish the endpoint dropping packets vs. packets being dropped by the network. Dropping the packets lead to the peer's congestion controller exiting the startup phase. That is going to have a negative effect on performance.

Re Q2, I would argue that, ideally, the peer SHOULD refrain from arming the PTO timer for the packets of the next PN space, when it knows that the endpoint has all the information to derive the traffic keys of the next PN space, and knows that the endpoint would send something once those traffic keys are derived. To give an example, a server SHOULD NOT arm the PTO timer for 0.5 RTT data until it receives ClientFinished. At the very least, we should caution the server to not send Handshake(PING) packets alongside the 1-RTT probe, as ACKs for Handshake(PING) packets would reset the PTO timer.

martinthomson commented 4 years ago

We encountered a problem similar to the one described here in our stack. The latter part of a coalesced packet was being dropped because we had to spend time validating the certificate. We ultimately added a small amount of buffering to address this. And I have debugged a case where handshakes were failing due to the interaction between PTO timers in different packet number spaces. For that, I don't have a good solution yet; I'm not convinced that deferring PTO is the right answer.

But I don't think that this requires anything in specifications, aside from perhaps a caution, noting that we already have a caution in Section 5.7 of the TLS draft. That already suggests and permits (though doesn't recommend) buffering.

We could add a note that failing to buffer legitimate packets could lead to marking delivered packets as lost unnecessarily and performance degradation

kazuho commented 4 years ago

And I have debugged a case where handshakes were failing due to the interaction between PTO timers in different packet number spaces. For that, I don't have a good solution yet; I'm not convinced that deferring PTO is the right answer.

I would argue that a probe packet is expected to be sent when an endpoint does not receive a response that it expects. That's necessary because it is the only way to guarantee forward progress (and therefore we use a MUST). But in this particular case that we are discussing, the ball is on the side of the peer. The peer is expected to respond with a new TLS handshake message contained by a QUIC packet using the next PN space. To paraphrase, sending packets before knowing that the peer is ready to decrypt them is a performance optimization. Doing that once is fine, but having a MUST for repeating that is questionable. While I am happy to discuss the nuance, I do think that it needs to be something other than "MUST send a probe."

Note also that an endpoint sending PTOs in this case would in turn require the peer to have a larger buffer for storing undecryptable packets in order to avoid buffer overflow (that leads to performance degradation).

But I don't think that this requires anything in specifications, aside from perhaps a caution, noting that we already have a caution in Section 5.7 of the TLS draft. That already suggests and permits (though doesn't recommend) buffering.

Aside from relaxing the MUST pointed out above, I think I'd be fine with just adding notes. Though I might argue that it is more important than the existing note that discusses about an "exceptional" case; some endpoints would almost always need time to calculate the next keys.

kazuho commented 4 years ago

FWIW, section 4.1.4 of the TLS draft states that: as keys at a given encryption level become available to TLS, TLS indicates to QUIC that reading or writing keys at that encryption level are available. These events are not asynchronous; they always occur immediately after TLS is provided with new handshake bytes, or after TLS produces handshake bytes.

Part of the problem is that this is untrue. TLS stacks sometimes take time in providing the net set of traffic keys. And QUIC stacks do not even pretend that TLS stacks are acting in a synchronous manner. Instead, they intentionally send ACKs for Handshake packets that are received later than the 1-RTT packets they buffer.

This is a behavior that is observable from the peer, and leads to the exit of slow start depending on how you design the loss recovery. I think we might want to use normative words stating what the expectations are.

ianswett commented 4 years ago

Thanks for filing this. I think I mostly understand the issue, and I think this is mostly a correctness issue with Chrome. That being said, we may want to add some caution that this is a potential issue and to avoid handshake operations(ie: cert verification on the client and remote handshaking on the server) from excessively delaying acknowledgements.

I've seen lots of cases when the cert is spread over 3 packets and Chrome ACKs the first two packets before the cert is verified, so this doesn't seem to be a case of key availability. Google servers don't have cert compression enabled yet, so many certs don't fit into 1 or 2 packets.

The recovery draft already says that peers don't arm for ApplicationData(including 0.5RTT) before the handshake is confirmed. Knowing the peer has the information to derive the keys isn't really enough, since you'd also have to know the peer has received a packet of the next encryption level. As a result, I can't think of any changes to this part of the PTO logic I'd want to make.

kazuho commented 4 years ago

@ianswett I agree that the there is no design changes necessary to the recovery draft.

The crux of the problem is the TLS draft. It currently states:

As keys at a given encryption level become available to TLS, TLS indicates to QUIC that reading or writing keys at that encryption level are available. These events are not asynchronous; they always occur immediately after TLS is provided with new handshake bytes, or after TLS produces handshake bytes. (section 4.1.4)

However, contrary to what the TLS draft says, Chrome (and IIUC what @martinthomson says, Firefox also) processes the TLS handshake messages containing the certificate chain in an asynchronous manner.

Consider the following example, where a web browser receives four packets:

The TLS draft suggests that client would be processing PN=3 before PN=4, because the browser would process the TLS handshake transcript synchronously and therefore it would have the 1-RTT keys when it recognizes PN=3 for the first time.

But the reality is that, if the browser takes 100ms to verify the certificate chain, the client buffers PN=3, acking PN=4 before PN=3, even though the interval between PN=3 and PN=4 is greater than max_ack_delay.

As can be seen, this is an observable behavior from the server-side.

Therefore, we need to fix the TLS draft at least (assuming that we agree that browsers are doing the correct thing).

ianswett commented 4 years ago

One more note is that the RTT samples can be greatly inflated if a packet(ie: the Initial) is lost and the PTO fires to retransmit the data, and that causes the previously buffered Handshake packets to be acknowledged. Coalescing an ack-eliciting Handshake packet with the Initial typically mitigates this, as long as it's processed before the buffered packets. Since coalescing is already an optimization in the draft, it may be worth a note that it can prevent inflated RTT samples.

I considered an algorithm to discard these RTT samples, but filtering RTT samples is a dark art I'd rather not put into the recovery spec unless it's absolutely necessary.

We've observed some cases where this causes the handshake to timeout because the server's PTO is so long, and our handshake timeout is fairly short(9 or 10 seconds).

kazuho commented 4 years ago

The discussion about RTT estimate is interesting. I'm not sure about the Initial vs. Handshake case, but when the client spends X milliseconds (e.g. X=100) verifying the certificate chain at the same time buffering 0.5 RTT data it receives, then the server would obtain an RTT sample of X - max_ack_delay milliseconds (e.g., 75 ms). That's not great.

One way of resolving the issue would be to not cap ack_delay using max_ack_delay for ApplicationData packets that were sent during the handshake. The rationale for not activating PTO for 1-RTT packets during the handshake is that the peer might not immediately have the necessary keys to process those packets. We assume the peer to buffer such packets. When the sender has such an assumption, it does not really make sense to cap the ack_delay.

I think doing something like that would be fine, even though I would not push on recommending such a strategy.

janaiyengar commented 4 years ago

@ianswett raises a good point about inflated RTTs. @kazuho's point about max_ack_delay actually exposes a grey area in the current semantics of that parameter. It is expected to be the maximum delay that the endpoint introduces. Other delay is supposed to be considered to be part of the "path". This is a situation where that assumption doesn't hold well. Separately, reporting this delay as part of ack_delay assumes that the implementation timestamps the packet when received at the endpoint. Meaning that if the endpoint actually does the cert verification synchronously, and does not read from the socket until that is complete, then it will need to read kernel receive times. (This also assumes that an endpoint is expected to timestamp the packet on receipt, not on decryption, which is very much how most stacks will do it, but there's not much difference at the moment in the common case between these two times.)

I'm tempted to say that we don't try to solve this in the protocol yet. Let's experiment with strategies that we think might work in implementations, share notes, and come back to the spec when we have a good answer. For the time being, we expect that the delays at the clients should not be super common, and even when it is the case, a couple of RTT samples should not significantly sway the RTT. I'm ok with that.

kazuho commented 4 years ago

@janaiyengar

Separately, reporting this delay as part of ack_delay assumes that the implementation timestamps the packet when received at the endpoint. Meaning that if the endpoint actually does the cert verification synchronously, and does not read from the socket until that is complete, then it will need to read kernel receive times.

I think we do not need require endpoints doing that. We do not even require endpoints to verify certificate chain asynchronously, meaning that we allow endpoints to simplify their implementation at the risk of performance penalty.

I'd argue that not capping ack_delay during the handshake is strictly better than capping the value all the time, assuming that the client's clock is not drifting. That is because the server would have a better RTT estimate when the client reports the correct numbers. If the client fails to do so, it'd be the same as status quo, where RTT estimate would become larger than the actual value.

All that said, I am totally fine with leaving this an open area that endpoints can experiment.

janaiyengar commented 4 years ago

Summarizing my thoughts after discussing this with @ianswett and thinking about this with @kazuho. Apologies for being verbose, but I'm trying to summarize from the top.

Problem:

At a high level, the problem we are discussing here is that an endpoint buffers packets, because it isn't able to read them or write acks for them, and it might take a while to dequeue the buffer. As a result, the acks can be sent much later than when the packets were received at the endpoint, causing the ack_delay to potentially be much higher than max_ack_delay. This inflates the smoothed_rtt and rtt_variance, inflating the PTO, which can cause handshakes to fail (if a "handshake timeout" period is employed).

The problem is that the receiver endpoint can take arbitrarily long to respond. The ack_delay variable is meant to signal precisely this -- we tout it as an important distinction from TCP -- and it is useful here, except that it is limited to max_ack_delay in the various RTT estimations. This was done deliberately, allowing for any additional delays to become part of the path delay, with the expectation that such additional delays are outside of the control of the QUIC endpoint and are likely to be recurring. While that remains true, the issue that we are discussing here is an extreme case that occurs during the handshake and is not expected to be recurring.

Proposed Solution:

We use the ack delay signal during the handshake, and acknowledge that we might need to treat ack delays reported during the handshake period as slightly special. Specifically, we can ignore the max_ack_delay, but we still limit the RTT sample to >= min_rtt. An endpoint does this until the handshake is confirmed. That is, until the endpoint is certain that the peer has 1RTT read and write keys and therefore does not have to buffer any more packets for lack of keys.

Also:

Separately, the recovery spec currently says don't arm PTO on ApplicationData until handshake complete. This should be handshake confirmed. This is because handshake completes at a client when it sends a CFIN, after which the client can send 1-RTT data. The CFIN however might get lost and the server might buffer the 1-RTT data until it receives a retransmission of the CFIN. Changing this to handshake confirmed makes it so that no PTO is armed on ApplicationData until there's no Initial or Handshake data left in flight.

kazuho commented 4 years ago

Regarding @janaiyengar's last point, I've changed the moment that the client starts arming PTO timer for ApplicationData PN space from handshake complete to handshake comfirmed. As Jana states, that is the correct thing to do based on the rationale that we have been using to explain why it should not be armed.

martinthomson commented 4 years ago

FWIW, I was really concerned about that last change. Aside from it making this a non-editorial change, the way that we avoid handshake deadlock depends very much on HANDSHAKE_DONE being followed by probes. But as the server confirms and completes the handshake at the same time, I've convinced myself that this is OK.

(That this is what our stack did, largely by accident, is vindication of sorts...)

I would encourage people to sit down and run the analysis on that change to convince themselves that this is not reintroducing the handshake deadlock.

The other part of the proposed solution needs to be slightly different. @janaiyengar proposes that we ignore max_ack_delay prior to confirming the handshake. I think that's good, but it is not completely correct, and nor would it be sufficient.

We need to ignore max_ack_delay for acknowledgements of packets that were sent prior to handshake confirmation. Those are the ones that might have been buffered until keys were ready.

In order for that to work, we have two new requirements on packet processing for buffered packets:

  1. endpoints MUST process and acknowledge (or discard) any packets that they might have buffered during the handshake within max_ack_delay of completing the handshake

  2. endpoints MUST report an ACK Delay that is based on when a packet is received, not when it was processed

kazuho commented 4 years ago

Regarding max_ack_delay, I think @martinthomson's comment is accurate with possibly one slight modification.

  1. endpoints MUST report an ACK Delay that is based on when a packet is received, not when it was processed

We might want to say that ACK Delay is the amount of the time the information (regardless of that being the packet itself or the acknowledgement) being buffered intentionally.

The observation here is that any unintentional delay (e.g., decryption time, payload processing time) should be considered part of RTT, because that is unavoidable.

And the added bonus of defining this way is that then there would be no change required for QUIC stacks that synchronously process the TLS handshake messages.

kazuho commented 4 years ago

Just FWIW, I might point out that we are actually making the design streamlined, by improving consistency in how we handle ack delay. The new rule is:

It's simply that up until now, we have misused max-ack-delay in the last bullet point case.

martinthomson commented 4 years ago

This is a good outcome, and I agree with Kazuho.

However, I am going to request that someone (not me preferably) take this to the mailing list to confirm with a wider group.

janaiyengar commented 4 years ago

@martinthomson : I agree with what you've spelled out above (here). It is close to what I had in mind, but I didn't do a good job of even spelling out fully what I was thinking.

janaiyengar commented 4 years ago

Posting the email I sent out to the list here, FTR. This got no pushback and support, FWIW. Note that this resolution makes this issue a design one, so I'm removing the editorial tag.

Problems:

During the handshake, an endpoint can receive packets (carrying data or acknowledgements) that it cannot
decrypt because the requisite keys are not yet available. This can happen because the crypto machinery at
the endpoint is waiting for earlier packets to be received to yield the keys. Similarly, a client might not
be able to read 1-RTT packets received from the server because it is waiting for certificate validation to
complete.

1. The assumption in the TLS specification is that when an endpoint is busy doing something (certificate
validation or key generation for instance), it won't be able to respond to received packets with
acknowledgements (Section 4.1.4).

Contrary to the assumption in the spec, Chrome and Firefox both process the certificate chain asynchronously,
and therefore acks handshake packets immediately, even the ones that arrive after they have started verifying
the certificate chain. But they buffer 1-RTT packets, and therefore the issue exists for 1-RTT packets.

As a result, there's potential delay, sometimes significant, in responding to some packets during the
handshake. There are two side effects of this.

S1. First, this means that an endpoint that runs a PTO timer and retransmits data for 0-RTT and 1-RTT packets
while the handshake is still in progress might be doing this quite unnecessarily. The peer may well have
received this data and have queued it, waiting to be able to decrypt it. 

S2. Second, acks for these packets that are buffered and then later decrypted and read can be sent much later
than when the packets were received at the peer, causing the ack_delay to potentially be much higher than the
advertised max_ack_delay. The endpoint ignores ack_delay larger than max_ack_delay, and this inflates the
smoothed_rtt and rtt_variance, potentially inflating the PTO significantly (which can cause handshakes to
fail if the endpoint employs a "handshake timeout" period).

Proposed Solutions:

S1. To solve the first problem, we already have a condition in the recovery draft to _not_ arm the PTO on
1-RTT data until the handshake is complete (Section 6.2.1). This is not quite adequate, since it still allows
a client to arm a PTO timer on 1-RTT data on sending a CFIN and 1-RTT data, before ensuring that the server
has received the CFIN. The proposed solution here is to change this condition from "handshake complete" to
"handshake confirmed". Meaning that an endpoint only arms a PTO timer on 0-RTT or 1-RTT data when it knows
that the peer has the keys necessary to read and write in 1-RTT, and can therefore respond with an ACK.

We think this holds up, but we'd like to get this under everyone's eyes, to ensure that the discussion isn't
missing subtle deadlock issues.

S2. To solve the second problem, the proposal is to use ack_delay as reported by the endpoint that buffers
the data appropriately. ack_delay is the amount of the time the information (regardless of that being the
packet itself or the acknowledgement) being buffered intentionally. For this, we need the following:

- endpoints MUST report an ack_delay that is the sum of the controlled delay (i.e. the duration of the data
and the ack being buffered)

- endpoints ignore max_ack_delay for acknowledgements of packets that were sent prior to handshake
confirmation. Those are the ones that might have been buffered at the peer until keys were ready.

In addition, we explicitly recommend endpoints that asynchronously process handshake messages to buffer
1-RTT packets, instead of dropping them, as dropping packets has significant impact on congestion control.
Note that this matches what those implementations already do.
janaiyengar commented 4 years ago

@martinthomson noted:

> - endpoints ignore max_ack_delay for acknowledgements of packets that
> were sent prior to handshake confirmation. Those are the ones that
> might have been buffered at the peer until keys were ready.

For this, endpoints need to retain an additional value: the highest packet number
that was sent prior to confirmation (or the lowest packet number sent after confirmation).
janaiyengar commented 4 years ago

This should be marked design.

ianswett commented 4 years ago

I'm trying to fully digest everything above, so I apologize if I missed something important.

The idea that you wouldn't acknowledge packets that can be processed while waiting for keys for queued packets is clearly unrealistic for a variety of reasons, so we should definitely remove any text that assumes otherwise.

I'm unclear about why moving from handshake complete to confirmed is an improvement, given the example you cite seems to imply the server is retransmitting unnecessary data due to client delays, and for the server complete and confirmed happen at almost identical times. Prior to CFIN or 1-RTT being acknowledged, the client knows the server has only one of the two keys. PTO will retransmit the CFIN on the first PTO and the 1-RTT data on a subsequent PTO(assuming the first PTO doesn't send both). I agree we expect HANDSHAKE_DONE to make this irrelevant in practice, but I can't reason about why this change is an improvement based on the problem description. Can someone lay that out for me?

I believe we should make normative changes to transport along the lines of S2. Specifically, I think we should say receivers SHOULD include the delay from the time an undecryptable packet is buffered to the time it's acknowledged in the ack_delay, as well as processing all buffered packets for a given encryption level before sending an ACK. Ideally, a receiver would process buffered packets after the full datagram being processed as well, since that allows senders to fully mitigate this issue, but we could leave that to a description, rather than normative text if people prefer. This provides more information to the sender, which it can ignore if it wants, and is much better than sending an Ack Delay of 0us in that case. It's a bit odd that we say a receiver SHOULD set the ack delay field to 0 for Initial and Handshake packets today.

kazuho commented 4 years ago

I'm unclear about why moving from handshake complete to confirmed is an improvement, given the example you cite seems to imply the server is retransmitting unnecessary data due to client delays, and for the server complete and confirmed happen at almost identical times.

I think it is because the client cannot make an assumption on how long it would take for the server to verify the client's TLS handshake messages being sent using Handshake packets. Important thing to note here is that it could not just be Finished. It might include client certificates. Theoretically, client's handshake transcript might be too large so that it requires multiple round-trips to be sent.

Regarding ack_delay, I might not sure if I fully follow. But I think I agree that it would become reasonable for an endpoint to send a Handshake Ack with a non-zero ack-delay, if it buffered the handshake packet that was undecryptable at the moment that packet was received, and later processed it. It is a natural outcome of us changing the definition of ack-delay to the sum of undecryptable buffering delay and acknowledgement delay.

As Initial packets would always be decryptable, I think "SHOULD set ack delay to 0" rule can remain for Initial packets.

ianswett commented 4 years ago

I'm unclear about why moving from handshake complete to confirmed is an improvement, given the example you cite seems to imply the server is retransmitting unnecessary data due to client delays, and for the server complete and confirmed happen at almost identical times.

I think it is because the client cannot make an assumption on how long it would take for the server to verify the client's TLS handshake messages being sent using Handshake packets. Important thing to note here is that it could not just be Finished. It might include client certificates. Theoretically, client's handshake transcript might be too large so that it requires multiple round-trips to be sent.

Specifically, is the concern is that the client will have it's Handshake data acknowledged and will spuriously PTO 1-RTT packets a few times over while waiting for the server to make 1-RTT receive keys available? Is this a problem that's been observed? It seems like it would in practice require client certs, since verifying only a CFIN should be very fast.

Making the change to wait for handshake confirmed to PTO 1-RTT data doesn't seem harmful given the existence of HANDSHAKE_DONE, but I'm also not sure it's going to matter much in practice. It also adds an exception to the current rule that PTO is armed when the client isn't sure the server has completed address validation or there are bytes in flight. If the Handshake data has been acknowledged, but the 1-RTT data hasn't, you'd have bytes in flight but not arm the PTO if I understand correctly.

Given there's still a SHOULD about sending packets from multiple PN spaces on PTO, including Handshake + 1RTT, I think this change will perform well, but I do have apprehension about seemingly small but subtle changes like this late in the process.

Regarding ack_delay, I might not sure if I fully follow. But I think I agree that it would become reasonable for an endpoint to send a Handshake Ack with a non-zero ack-delay, if it buffered the handshake packet that was undecryptable at the moment that packet was received, and later processed it. It is a natural outcome of us changing the definition of ack-delay to the sum of undecryptable buffering delay and acknowledgement delay.

As Initial packets would always be decryptable, I think "SHOULD set ack delay to 0" rule can remain for Initial packets.

Agreed, the potential for delay is only for the first Handshake and 1-RTT ACKs.

My summary: The issue is that ACKs for Handshake and 1-RTT packets could be delayed because the peer doesn't have the keys(either because it takes time to derive them, or it's missing crypto data to derive the next key).

This creates two potential issues(neither of which effects correctness, but could be suboptimal):

  1. Packets may be acknowledged later than expected, causing the peer to spuriously PTO.
  2. The first RTT sample in Handshake and 1-RTT packet spaces could have substantial delay in excess of the 'no ack delay' in Handshake or max_ack_delay for 1-RTT, which inflates SRTT and RTTVar.

Changing when to arm PTO for 1-RTT data mitigates one case of the 1st issue(the server being slow to make 1-RTT receive keys available), but I don't believe it helps otherwise with the issue, such as delays in generating Handshake keys. Are there any other design changes(besides complete->confirmed) being considered or suggested for the first issue that I'm missing?

If there's no loss, the second issue fixes itself without harm, which is nice. And it's not as bad for ApplicationData, since that will have a non-0 max_ack_delay, which may be enough to absorb the endpoint delay generating keys or doing cert verification, though it won't typically be enough to adjust for a packet being lost and the data being PTO'd and then that making keys available. I think Jana's suggestion ack_delay to including the time packets were buffered is sensible, and ignoring max_ack_delay for the first sample in a PN space seems relatively safe given at least one prior sample. But what if there are no Initial ACKs received and ACK with the large ack_delay is the first ACK of the connection? I guess you're stuck with one bad sample and hope the next is better?

The one change we've found mitigates this issue entirely is coalescing packets from the next PN space(Handshake or 1RTT) and then processing the full coalesced packet before processing all buffered packets and sending an ACK. That ensures there's one new packet being acknowledged and you get a non-delayed RTT sample. It doesn't seem onerous to add a "SHOULD process all packets in the current datagram and all buffered packets before sending an ACK to avoid generating multiple inflated RTT samples" to me?

janaiyengar commented 4 years ago

@ianswett : I'm yet to process the messages above -- but this issue was covered in this email thread on the list.

kazuho commented 4 years ago

@ianswett

Specifically, is the concern is that the client will have it's Handshake data acknowledged and will spuriously PTO 1-RTT packets a few times over while waiting for the server to make 1-RTT receive keys available? Is this a problem that's been observed? It seems like it would in practice require client certs, since verifying only a CFIN should be very fast.

I think the issue is about principles.

I tend to agree that it would be fine for an endpoint to arm PTO timer for 1-RTT packets, when it knows that the peer is going to be able to handle them. However, if we are going to talk about such cases, then I might point out there are other similar cases that are missing in the draft, that might be as important as the case you point out:

My view is that all these are "optimizations" that an endpoint might attempt, based on the knowledge of the form of the TLS handshake being used. But they are just optimizations.

I think it would be fine to talk about such optimizations, but at the same time think that we should provide safe and principled rules as baseline, so that endpoints following the recommendations do not end up sending bunch of unnecessary packets.

Given there's still a SHOULD about sending packets from multiple PN spaces on PTO, including Handshake + 1RTT, I think this change will perform well, but I do have apprehension about seemingly small but subtle changes like this late in the process.

Yeah, I tend to think this as a bug fix. Existing PTO rules seem to assume that the client would spend time in processing the Handshake packets, but the server would not. But the reality is more nuanced, as pointed out by the client-certificate case as well as the examples shown above. And in case where the assumption is too aggressive (the client-cert case), we might see a number of unnecessary packets being sent.

All that said, I do think that we should try to minimize the impact of the changes that we introduce to the drafts. That's why I think that something along the lines of "SHOULD NOT arm PTO timer for ApplicationData packets until the handshake is confirmed, unless the sender can assume that the peer would be capable of processing ApplicationData packets when they arrive." Clients arming PTO for 1-RTT packets when client-certificate is not used would fall into this exception, and therefore the design change would not affect the existing QUIC stacks, assuming that they do not use client certificates. Stacks that do use client certificates have to have this fix applied.

kazuho commented 4 years ago

@ianswett

But what if there are no Initial ACKs received and ACK with the large ack_delay is the first ACK of the connection? I guess you're stuck with one bad sample and hope the next is better?

That's correct. The resolution proposed by @janaiyengar does not mitigate that issue. By including queuing delay of unencryptable packets into ack_delay, the proposed resolution improves the quality of RTT samples. But as you say, it does not mitigate the case where that would be the first sample.

It doesn't seem onerous to add a "SHOULD process all packets in the current datagram and all buffered packets before sending an ACK to avoid generating multiple inflated RTT samples" to me?

I'm not sure if doing that mitigates the problem. The proposed resolution recommends endpoints to send ACK for a handshake message before spending time to process it. As an example, a client would ack the Handshake packets before starting to verify the certificate chain. And after that, it would send a Handshake packet (that contains ClientFinished only, no ack will be included), and a 1-RTT packet acking the 0.5-RTT data sent by the server.

If the Handshake ack reached the server, then the server would have a good min_rtt estimate. If it did not, then, the next sample that the server would receive is the ack for the 0.5-RTT packet(s). As the servers do not arm the PTO timer for 0.5-RTT data, it is hard to have a protocol mechanism that prevents the first 1-RTT ACKs sent by the client from having a ack delay field that reflects the amount of time spent for verifying the certificate chain.

ianswett commented 4 years ago

I created a separate issue(#3980) and PR(#3981) for the inflated RTT problem, since the solution seems fairly disjoint.

ianswett commented 4 years ago

@ianswett

But what if there are no Initial ACKs received and ACK with the large ack_delay is the first ACK of the connection? I guess you're stuck with one bad sample and hope the next is better?

That's correct. The resolution proposed by @janaiyengar does not mitigate that issue. By including queuing delay of unencryptable packets into ack_delay, the proposed resolution improves the quality of RTT samples. But as you say, it does not mitigate the case where that would be the first sample.

It doesn't seem onerous to add a "SHOULD process all packets in the current datagram and all buffered packets before sending an ACK to avoid generating multiple inflated RTT samples" to me?

I'm not sure if doing that mitigates the problem. The proposed resolution recommends endpoints to send ACK for a handshake message before spending time to process it. As an example, a client would ack the Handshake packets before starting to verify the certificate chain. And after that, it would send a Handshake packet (that contains ClientFinished only, no ack will be included), and a 1-RTT packet acking the 0.5-RTT data sent by the server.

I agree that if you have both sets of keys, the transport should send an ACK and not wait for all async crypto processing to complete. This is similar to receiving a GET request at a proxy. The proxy acknowledges the GET immediately even if it has to go to a backend to respond. But that doesn't prevent greatly inflated RTT samples when the keys aren't immediately available.

ianswett commented 4 years ago

@ianswett : I'm yet to process the messages above -- but this issue was covered in this email thread on the list.

It's also above, but the inflight PR only attempts to address part of the problem, not the inflated RTT issue that I mentioned halfway through the issue.

Also, I'm finding it confusing to track what problems are trying to be solved and what ones are already largely solved, so I was hoping restating my understanding would allow you to correct me if I'm misunderstanding something.

martinthomson commented 4 years ago

I think we should flag this as design and run the process. It's hard to pin down exactly what changes here substantively, but it's worth getting the eyes.