private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
561 stars 165 forks source link

picoquic ignores ACK #1734

Closed jmuecke closed 2 months ago

jmuecke commented 2 months ago

I used picoquic as QUIC client to communicate with a modified quic-go server, which responds with an acknowledgment without coalesced ServerHello in response to the ClientHello. The acknowledgment seems to be ignored by picoquic, as it does not react with probe packets.

This issue is most noticeable when the server is slow to complete the handshake. To simulate this, I introduced a delay after the ACK. I used the picoquic interop-runner docker image for testing. A pcap file is attached, with the ACK visible in Wireshark as packet No. 6.

The ACK should allow the client to accurately estimate the RTT, program the PTO, and send probe packets upon PTO expiry. Currently, picoquic does not react to the acknowledgment by sending probe packets.

I suppose the issue is in picoquic_update_path_rtt, but wasn't able to track down the exact line causing this. Most relevant seem to be the handling of the first RTT estimate, but this and the checks in L285-287 look correct to me: https://github.com/private-octopus/picoquic/blob/e78af222dab8eb1816d1b505da20dffe557aa3e1/picoquic/timing.c#L293-L296

Do you have any ideas on this?

picoquic-ignores-instant-ack.pcapng.gz

huitema commented 2 months ago

Do you happen to have a qlog trace? I need to correctly reinstall wireshark on my new laptop, and that will take some time.

jmuecke commented 2 months ago

Yes, this and the terminal output.

test_log.txt 8b77bebc8f7bd8d6.client.qlog.txt

huitema commented 2 months ago

Looking at the qlog, I see this initial exchange:

image

I am not sure what you feel is wrong there. The picoquic client sent an Initial message that was acked. There is no need to retransmit anything, since the message was acked. The client waits until the next message from the server, which arrives 200ms later. The handshake proceeds normally after that.

jmuecke commented 2 months ago

Yes, there is no need to retransmit data, but a probe packet should be sent based on the new RTT estimate.

Assume that instead of being slow to continue the handshake, the remaining server packets are lost and the server is blocked by the anti-amplification limit.

In that case, we want picoquic to send probe packets to unblock the server after PTO expiry. Picoquic will wait until its default PTO (1s) expires to send the probe packet. However, upon receiven the acknowledgment, the RTT estimate of the client should be updated to a much lower value, resulting in a lower PTO and lower time to recovery from this supposed packet loss.
Following the above example, the RTT is ~16 ms, and since this is the first RTT sample the PTO will be ~48 ms. Since no probe packet is sent within that time, it appears that the acknowledgment is not handled correctly.

What do you think? Did I miss anything?

I think below are the relevant paragraphs from the RFCs:

According to RFC9000 the client MUST send a probe packet during this phase in the handshake.

In this case, when the client has no reason to send additional packets, the server will be unable to send more data because it has not validated the client's address. To prevent this deadlock, clients MUST send a packet on a Probe Timeout (PTO); see Section 6.2 of [QUIC-RECOVERY]. Specifically, the client MUST send an Initial packet in a UDP datagram that contains at least 1200 bytes if it does not have Handshake keys, and otherwise send a Handshake packet.

And the timer SHOULD be restarted every time an ack-eliciting packet is sent or acknowledged RFC9002:

A sender SHOULD restart its PTO timer every time an ack-eliciting packet is sent or acknowledged, or when Initial or Handshake keys are discarded (Section 4.9 of [QUIC-TLS]).

huitema commented 2 months ago

You had me a bit confused with the reference to QUIC-TLS. I think you meant QUIC-RECOVERY. But yes, the correct text is the anti deadlock specification in section 8.1 of RFC 9000.

Deadlock prevention is handled by the code in function picoquic_prepare_packet_client_init, lines 1960-1972 in sender.c. The send code in sender.c is supposedly always executed immediately after packets have been received, and it will set the Initial PTO based on the most recent timer. So we should have seen a repeat code for deadlock prevention after a short timer.

Obviously, that did not work in your case. It is worth investigating, but I need first to build a test that reproduces the issue.

huitema commented 2 months ago

Turns out that the repro is not easy. The repro step are:

  1. client sends initial packet
  2. server sends immediate ACK in a packet by itself
  3. server sends the beginning of the first flight, but all packets are lost. Also the first flight is not complete, and no packets are coalesced.
  4. Verify that client wakes up after PTO and sends a probe in a short delay

I do have tests of the amplification limit, but none of them start with "server sends immediate ACK in a packet by itself". The initial ACK are always bundled with an initial packet containing the server hello. That's a different behavior. And it is not too easy to change.

In fact, if your implementation send the Initial ACK in the same packet at the server hello, we probably would not see the issue.

jmuecke commented 2 months ago

Deadlock prevention is handled by the code in function picoquic_prepare_packet_client_init, lines 1960-1972 in sender.c. The send code in sender.c is supposedly always executed immediately after packets have been received, and it will set the Initial PTO based on the most recent timer. So we should have seen a repeat code for deadlock prevention after a short timer.

Which send code are you referring to exactly? Looking at picoquic_prepare_packet_client_init and functions calling it, I only found it to be called in the context of sending packets, which would have it work with the default PTO in the example.

Following the RFC9002 reference above, the PTO timer needs to be restarted on reception of an acknowledgement that newly acknowledges an ack-eliciting packet.

Regarding the reproduction, I agree with the given steps. But sending instant ACKs (ACK without ServerHello) happens quite often with current deployments. Maybe one could only construct the ACK without simulating the entire server and check if the client sends an additional packet before the default PTO?

huitema commented 2 months ago

Picoquic uses just one global timer, set to the earliest of all timed events. The sending code is executed when that timer fires, and runs a "what do I have to do" process to execute whatever actions are pending. Obviously, a step is missed in the initial state, and that process does not execute properly.

And yes, it is probably simpler to treat this as a unit test, as you describe.

huitema commented 2 months ago

My initial repro does not find an issue:

image

You see in the test a PTO probe being sent 30 ms after the ACK was received, which is the expected behavior. I noticed the error message after the log of the incoming ACK packet in the previous example, "Received unpadded initial, length=51", so I changed the test to send short ACKs, and get a trace much similar to yours:

image

So, no repro. I think that you observed the issue on an old version of the code, and that the bug was fixed in PR #1731, Fix RTT computation on first sample, specifically the lines:

image

Fixing the closing parenthesis ensured that the retransmission timer would be updated after the first RTT measurement. A version before that would have kept the initial retransmission timer until the 2nd measurement, which would cause exactly the bug that you are reporting.

I am adding the test code in PR #1739, "Add test of initial PTO for retransmissions." I will make sure that the Docker image also gets updated.

huitema commented 2 months ago

I think this is fixed. The interop runner image has been updated.