quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.84k stars 394 forks source link

Sent ACK state accumulates when acting as a pure receiver #1756

Closed honzasp closed 7 months ago

honzasp commented 9 months ago

I have a Quinn client and server, where the server sends a stream of datagrams to the client. I found that the client is leaking memory, and memory profiling with Heaptrack has shown that the culprit is the BTreeMap::insert() call in quinn_proto::connection::spaces::PacketSpace::sent(), called from PacketBuilder::finish_and_track(). Here is a screenshot from Heaptrack:

Screenshot from 2024-02-15 21-06-45

Here is a minimal reproducing example: https://github.com/honzasp/quinn_leak

It seems that packets are added to PacketSpace::sent_packets and then never removed. Perhaps because the data is sent using datagrams and not via streams?

Ralith commented 9 months ago

Thanks for the report! Packets are removed from sent_packets when they are acknowledged or deemed lost. This mechanism is not sensitive to the contents of packets, including whether they contain stream or datagram data. Have you verified that sent_packets.len() grows without bound?

honzasp commented 9 months ago

Thanks for your prompt response! I added a debug print of self.sent_packets.len() to PacketSpace::sent() and it grows without bound in the client (but not in the server).

honzasp commented 9 months ago

I also tried using Quinn from the latest main branch, with the same result.

Ralith commented 9 months ago

Interesting; thanks for confirming. I'll try to have a look this weekend.

Ralith commented 9 months ago

The SentPacket entries accumulating on the client side, i.e. on the receiver (I missed that at first), are ACKs. Because the client never sends anything ACK-eliciting (i.e. application data) itself, the sender is never technically required to transmit its own ACKs, without which the client does not discard state. We should transmit ACKs occasionally even when not required to. We used to, but broke it in https://github.com/quinn-rs/quinn/pull/1415. We should also defensively forget ACK-only packets that accumulate beyond a threshold.

As an immediate mitigation, you can arrange for an endpoint which is only receiving data at the application layer for long periods to occasionally send something. For example, in the repro, having the client send one datagram for every 1000 it receives prevents state accumulation.

Relevant RFC text:

Endpoints acknowledge all packets they receive and process. However, only ack-eliciting packets cause an ACK frame to be sent within the maximum ack delay. Packets that are not ack-eliciting are only acknowledged when an ACK frame is sent for other reasons.

When sending a packet for any reason, an endpoint SHOULD attempt to include an ACK frame if one has not been sent recently. Doing so helps with timely loss detection at the peer.

An endpoint that is only sending ACK frames will not receive acknowledgments from its peer unless those acknowledgments are included in packets with ack-eliciting frames. An endpoint SHOULD send an ACK frame with other frames when there are new ack-eliciting packets to acknowledge. When only non-ack-eliciting packets need to be acknowledged, an endpoint MAY choose not to send an ACK frame with outgoing frames until an ack-eliciting packet has been received.

An endpoint that is only sending non-ack-eliciting packets might choose to occasionally add an ack-eliciting frame to those packets to ensure that it receives an acknowledgment; see Section 13.2.4

A receiver that sends only non-ack-eliciting packets, such as ACK frames, might not receive an acknowledgment for a long period of time. This could cause the receiver to maintain state for a large number of ACK frames for a long period of time, and ACK frames it sends could be unnecessarily large. In such a case, a receiver could send a PING or other small ack-eliciting frame occasionally, such as once per round trip, to elicit an ACK from the peer.

Thanks for the minimal repro! That made diagnosing this very straightforward.

Ralith commented 9 months ago

Fix drafted in https://github.com/quinn-rs/quinn/pull/1761.

honzasp commented 9 months ago

Thank you very much for the quick diagnosis and the suggested workaround :pray:

Ralith commented 8 months ago

A big part of why we took so long to notice this is that if you're ever receiving stream data, you automatically send flow control updates occasionally that are themselves ACK-eliciting. State accumulation as demonstrated here can only occur when a connection is exclusively receiving application datagrams.

Ralith commented 7 months ago

This was fixed in #1761.