Limit memory consumption when large windows are used

huitema commented 1 year ago

See issue #1499. In case of losses, copies of lost packets are kept in the "loss confirmation" queue (pkt_ctx->retransmitted_newest, pkt_ctx->retransmitted_oldest), so that they can be processed if a late acknowledgement arrives. In case of large losses, this causes lots of memory to be frozen for a long time. The actual "spurious retransmissions" happen, but they are quite rare. It should be possible to limit the size of this queue, and its memory allocation.

huitema commented 1 year ago

The loss confirmation queue is visited during ACK processing, with the following goals:

maintain variables such as "max spurious rtt", "max reorder delay", "max reorder gap", which could be used to tune the loss detection logic -- but are not in fact used, these are just statistics that are available for a potential observer.
detect "spurious losses" and signal them to the congestion control algorithm
process the "ack of ack" logic, marking ACK frames, stream frames and other control frames as acknowledged, thus avoiding further repetition.

This logic works well, and reuses the existing ack of ack code. However, it is used rarely, as spurious loss detection is rather rare. If the packet loss rate is low, it does not use much resource, but if the packet loss rate is very high it can result in high memory requirements, as seen in issue #1499. There are three possible fixes:

1- Remove the loss confirmation queue entirely. This is easy to implement, but will cause drop in performance in cases of spurious loss detection, because the congestion control will not be able to compensate for the spurious losses. 2- Rewrite the loss confirmation code to not keep the full packets. For example, if the packet contained a stream data frame, it is sufficient to keep track of stream-id, offset, length and maybe FIN status. Add required headers, and the whole data could be contained in less than 100 bytes instead of more than 1500. This could result in significant gains, at the cost of extra processing... and a fair amount of new code. 3- Cap the size of the loss confirmation queue, for example to a fraction of the CWIN, or maybe a fraction of the max window. Once the queue reaches that size, do not add any more packets to it. This will mostly maintain the current behavior for the common case, but the cap will prevent widely increasing memory consumption in high loss cases.

Of the 3, the third appears the safest, so we will try it first. But first, we have to measure the size of the loss confirmation queue, so we can write tests to verify its behavior.

huitema commented 1 year ago

Attempts to just limit queuing have two consequences:

1) The CPU consumption generally increase. 2) The simulated execution time also increases, albeit by a small fraction.

The increase in CPU consumption is tied to "sending more packets". That's because spurious loss detection works less well. When a spurious loss is detected, data sent in the packet is marked as "already received", and will not be sent again. But if the packet is not it the queue, this marking does not happen, and duplicate versions of the data are resent.

Looking at the code in details shows that ACK-only packets are never placed in the queue, which means that the ack-of-ack logic if not used much after spurious loss of ACK frames. That's probably fine, since there are other ways to contain the size of ACKs.

This first attempt shows that the simple limitation of the queue size is probably not desirable.

huitema commented 1 year ago

Tried a simple fix: accept up to 1024 packets in the loss detection queue. All the tests are passing. The test with the largest queue size was "satellite_cubic_loss", in which the queue size was up to 7872 packets. Limiting the queue size did not affect performance, probably because when that many packets are deemed lost, these are "real" losses, not spurious. However, some interesting observation: running under debugger, the maximum memory allocation for the test process was 216MB, regardless of whether the queue size is limited or not. In theory, the size of the queue should have a visible effect, dropping from 13MB (16647872), rather than 1.7MB. In practice it does not. Moreover, the peak memory utilization happens before* the loss detection queue reaches its maximum.

The likely suspect here is the "data node". There is a list of 32768 "data nodes" allocated in the QUIC server context, for a total size of over 50MB. Do we need that many?

huitema commented 1 year ago

Analysis of memory consumption on the server/sender:

packets queued for retransmission: O(W)
packets queued for duplicate loss detection: O(Wx), with error rate X <= 1
data nodes used for retransmission: O(Wx)

Analysis of memory consumption on the server/sender:

data nodes queued for reordering: O(W), or more depending on server strategy.

How do we protect the server and the client? On the client, potentially cap the "MAXDATA" credits. On server, set cap to W. Worst case scenario requires 3W packets for retransmission, loss detection, and data queues. Maybe do not keep so many packets in the quic server pre-allocation queues. But the "max window" limit is per connection. If we consider the whole server side, the memory occupation grows as O(WC), with C the number of connections. Can we devise some server wide mechanism?

huitema commented 1 year ago

First question, what happens if the application runs out of memory? Memory allocation include a lot of small allocations, contexts for connections or streams, parsing of H3 messages, etc. Failure there typically results in connections closed with "local error".

The bulk of the memory is in three big stashes:

picoquic_stream_data_node_alloc: get stream data node from pool or through malloc
picoquic_create_packet: create a packet
picoquic_add_to_stream_with_ctx: allocate buffer for next data to send on stream (if not using just in time transmission)
picoquic_queue_stream_frame_for_retransmit: create "datagram" frames that hold data to retransmit -- just like datagram frames or misc frames.

Packets and Nodes are managed through pools. Failure to allocate a packet results in a MEMORY_ERROR, closing the connection, or possibly the whole application. Failure to allocate a node results in the incoming packet being dropped.

Misc frames are allocated through malloc. Failure to allocate a MISC frame results in a memory error .. which is then ignored! This will most likely result in a protocol error down the line, as missing stream data is never sent.

It seems we should have two goals:

1) proactively manage buffer pools so they never grow above a specified size for a specific QUIC context. 2) Make sure that failure to allocate buffers result in something sane.

The deployment guidance would thus be to first set a reasonable limit to how many packet buffers can be allocated, and make sure that this leaves enough memory so that all the small allocations can be correctly managed. This requires developing some kind of memory manager with two levers:

1) Manage sender memory by dynamically tuning the "cwin max" parameter that applies to all connections. Start decreasing the cwin_max if the size of the packet pool goes above some preset size. Resume increasing it if the packet pool falls below a "low water mark".

2) manage receiver memory by tuning the "max data" parameter. That part will require some further studies, but shrinking max data effectively forces the eer to correct packet losses before feeding new data, which will progressively decrease the size of the "packet nodes" pool.

In that context, the management of retransmitted packets is a bit secondary -- we can probably just keep the code as is, but maybe also keep track of the queued data.

huitema commented 1 year ago

Added a test to limit the MAX Data in a satellite run "with losses". Results:

The MAX Data is fixed to the theoretical BDP, which implies about 8100 packets
The max number of nodes allocated drop from 47K to 28K
The max number of packets drops from 22.6K to 13.7K
The transport time increases from 8 seconds to 11.5.

This does show a memory reduction, but not 28K is still about 3.5 times more than the BDP, which is not great. The increase in execution time roughly corresponds to the strategy of only increasing the BDP if it is half used. Trying a variant with a larger MAX Data shows that the relation between MAX Data and performance is not obvious:

The MAX Data is fixed to twice the theoretical BDP, which implies about 16000 packets
The max number of nodes allocated drop from 47K to 22K
The max number of packets drops from 22.6K to 16.5K
The transport time increases from 8 seconds to 10.

There are probably a combination of several effects. Limiting the BDP limits the amount of data in flight, which reduces the number of packet losses, which in tuner may well reduce node allocation somewhat. Nodes are allocated to incoming packets, regardless of their size, but the data frames may not always carry a full frame. In fact, error retransmission, as implemented, causes fragmentation of the data frames, which may explain why the number of nodes is more than twice what we expect.

dongsf commented 1 year ago

Thank you very much for your continued attention. I have tested these submissions, but they were merged on an old branch. Memory usage is already very low, around 30MB. However, when I conducted a stress test on the downstream on the client side, there was still an abnormal increase in memory usage, especially evident when network packet loss occurred. My device resources are very limited and cannot store so many qlogs. However, I was still able to track the entire memory allocation during the testing process through other means. It seems that there is some memory allocated by nodes that has not been recycled into the memory pool, resulting in a difference between nb_data_nodes_allocated and nb_data_nodes_in_pool .I don't know where this node memory went (because there are no longer streams sending a large amount of data). Of course, I will also test the new code branch to see if the same problem exists. I will update the test results in a timely manner and try to send the qlog when the anomaly occurs.

huitema commented 1 year ago

The new code is meant to be use while still limiting the maximum window size. The changes should only be visible on the server side.

On the client side, a large loss event will cause allocation of "nodes" to match the packets delivered out of order when the losses occur. These packets will be counted in nb_data_nodes_allocated, but will only be recycled to the pool when new packets are received that close the gaps, and the data is delivered through the stream API. That will only happen a couple RTT after the event.

huitema commented 7 months ago

Closing this issue, as memory consumption is very much per spec.

private-octopus / picoquic

Limit memory consumption when large windows are used #1505