Unbounded queues might lead to excessive memory usage

Matthias247 commented 3 years ago

The endpoint and per connection logic is running on individual tokio tasks. To communicate between those tasks, unbounded futures::channel::mpsc channels are used. In specific there are 2 channels:

One which carries endpoint events (e.g. received datagrams) from the endpoint to the connection task
Another one which carries connection events (e.g. produced datagrams) to the endpoint task

The fact that these channels are unbounded raises the question on what exactly will happen if one of the tasks is producing data faster than the other task can consume it. The question probably became more important recently due to adding additional fairness between tasks: Since the endpoint task will yield more often, the connection task might have more chances to enqueue new data that can't be sent right away.

The usual problems with this design are unbounded memory growth - and that as soon as the queue reached a certain size the components mostly work on outdated/non-interesting data, latencies will be terrible and reliability too.

Let's briefly look at each side individually:

Endpoint task -> Connection task

This queue is used for forwarding received datagrams. In the connection task all received datagrams are currently immediately processed. As long as single-threaded runtime is used, the queue is not really unbounded since no new datagrams won't be enqueued by the endpoint as long as the connection is working on the existing ones. Instead if more datagrams would be enqueued in the UDP socket, they would get dropped there.

I think this is ok for the moment.

Connection task -> Endpoint task

The connection produces datagrams whenever it is deemed ok by the congestion controller, and will enqueue them on the endpoint for transmission. I think this queue can get problematic in some situations. E.g. when either the connection task gets scheduled often enough between endpoint task iterations that it produces more datagrams than the endpoint can send in one iteration, or if multiple connections all have data to send.

I don't really think that letting the endpoint task perform writes until the queue is drained would be the right solution - it would again lead to long times where the stack doesn't process either peer data (ACKs) or user data.

I can at the moment see a variety of ways to improve this:

The simplest: Use a bounded channel between the components and try_send. If the queue is full because the endpoint can't keep up, the packet will get dropped and the congestion controller will deal with it. Downsides:
1. We spend all the CPU time for producing a packet (it's expensive!), and then just drop it
2. Might not scale will in a scenario where an endpoint is serving a high amount of connections. There doesn't seem any fairness regarding which packets will get dropped (although one could claim random == fair?).
Build some reservation system where connections can only produce packets if endpoint capacity is available. Examples:
1. Add a single async semaphore to the connection -> endpoint channel, and let the connection try to acquire MAX_TRANSMIT_DATAGRAMS before trying to produce packets and submitting them. The permits will get released once the endpoint is done with transmissing packets. If the semaphore is fair, then all connections should be able to produce batches of packets in order.
2. Do the same, but with one async semaphore per connection. This essentially builds a per-virtual-channel limit even though there is only one real channel. Seems a bit easier to understand since it essentially limits the amount of outgoing datagrams a single connection can have in-flight. But doesn't really limit the overall amount of outgoing packets queued in the endpoint or channel (matters more for a really high amount of connections)
Restructure the whole quinn module to merge connection and endpoint tasks, and to only call poll_transmit on connections if there is really capacity to transmit packets left. That also allows the endpoint module to implement fairness between connections. This might be computationally cheaper than everything else. But it would be a huge refactoring endeavour.

djc commented 3 years ago

If we use a bounded channel and check it for capacity before we try to build a packet, we do some extra work in the case of a TOCTTOU failure but might otherwise be okay?

Merging connection and endpoint tasks seems like it'd substantially limit parallelism?

I guess "excessive memory usage" and "fairness" are perhaps different issues.

Matthias247 commented 3 years ago

we do some extra work in the case of a TOCTTOU failure but might otherwise be okay?

Would you want to drop the built packet then, or wait until capacity is available (buffer locally - or use poll_send)?

Keep in mind that under high load every connection will encounter the problem, and everyone starting to drop packets would have a severe perf impact - so I wouldn't recommend that.

Storing the transmit first locally, trying to send it via poll_send and not producing new datagrams until that is done should work, and is conceptually kind of the same as 2.i described above. The fairness then depends on the channel implementation that is used, and whether is fairly queues requests from senders that have started calling poll_send. I know that is given for the futures-intrusive channel and might be also for the tokio one.

Small benefit of the semaphore is that you can wait on different units like datagrams - even though the send operation later on uses a Transmit (potentially spanning multiple datagrams). The downside is that you now have 2 synchronization primitives in the chain (the semaphore and the channel), instead of just one.

Merging connection and endpoint tasks seems like it'd substantially limit parallelism?

Kind of, but I'm less worried about that since the library performs on par or even worse with the multi-threaded tokio runtime anyway. The synchronization and context switch overhead drowns the parallelism gains. The best way to improve parallelism is to spin up an independent endpoint at the moment, but that requires more application level changes. Anyway, that is kind of a separate discussion to this one.

I guess "excessive memory usage" and "fairness" are perhaps different issues.

They are for sure, but the impact on the other issue should be considered.

djc commented 3 years ago

So using a bounded channel and "caching" any packets produced if the channel is full seems like fairly straightforward change that would move this forward, then?

Ralith commented 3 years ago

Kind of, but I'm less worried about that since the library performs on par or even worse with the multi-threaded tokio runtime anyway. The synchronization and context switch overhead drowns the parallelism gains.

Have we actually benchmarked this with multiple connections? If so, it's a strong argument in favor of merging the tasks for simplicity. I don't think the refactoring would be too bad.

Ralith commented 6 months ago

The connection produces datagrams whenever it is deemed ok by the congestion controller, and will enqueue them on the endpoint for transmission.

This is bounded in the worst case by the sum of all connection's congestion windows, so I think the threat here is limited in practice. Getting rid of the intermediate queue by e.g. transmitting to the UDP socket directly might still be a significant constant-factor improvement in worst-case memory use.

Ralith commented 6 months ago

https://github.com/quinn-rs/quinn/pull/1729 fixes the biggest risk here by no longer buffering datagrams in userspace at all.

quinn-rs / quinn

Unbounded queues might lead to excessive memory usage #1131

Endpoint task -> Connection task

Connection task -> Endpoint task