private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
551 stars 162 forks source link

Did picoquic have any memory consumption exceptions during my test? #1499

Closed dongsf closed 1 year ago

dongsf commented 1 year ago

It seems that the client side has a lot of memory leaks. and stream readcb is not triggered properly. The exception is occasional,and I can't find the root cause. My picoquic is behind the current branch a lot. I will test it again with new one.

dongsf commented 1 year ago

ed83cb8a37546ed9.server.qlog.zip

dongsf commented 1 year ago

ed83cb8a37546ed9.client.qlog.zip

huitema commented 1 year ago

The memory consumption of the library is tested by running the test suite with the "sanitizer" options. See for example the github action "CILeaktests", .../.github/workflows/ci-asan-ubsan.yml.

But there is a test gap. All the tests run using network simulations. This provides good coverage for the library, but this does not cover the demo application of the socket code. For now, we have to test that manually. My immediate suggestion would be to run your test with the sanitizers enabled. That would detect the source of the leaks.

Of course, we could add tests to verify the demo application as well as the library. I am not sure how to do that in a way that works as a github action, but a PR adding this test would be very much welcome!

dongsf commented 1 year ago

In the absence of a sanitizer on my client side, I have tested it by adding qlog. Through this, I have discovered that the memory consumption is a result of data loss in the stream queue. Additionally, I have noticed that the server side is experiencing a high number of retransmits, which unfortunately exacerbates the data loss problem. image

dongsf commented 1 year ago

10ecae863449daee.zip 10ecae863449daee.server.qlog.zip

huitema commented 1 year ago

Thanks for the detailed traces. I think I understand what is happening. This is not a memory leak per se, but for you the effect is probably the same. The amount of memory required grows a lot when lots of packets are lost. This may well causes adverse effects, such as slowing down the server, causing further losses. I think we should study ways to reduce or limit the memory consumption.

The normal path for packet processing is:

In case of transmission error, this becomes:

So, at that stage, we have almost doubled the memory allocation per packet: the packet copy is kept in the "confirmation of loss" queue, but a copy of the content is kept in the "stream data that should be resent" queue. In your example, there are very many packet losses. The queue of data to be resent will grow substantially.

This is not a "memory leak" in the classical sense. The code still keeps track of these data. The allocated memory will eventually be freed, as follow:

In most applications, this behavior is not a problem, but your tests show an interesting "failure mode": a very large number of packets are lost in transmission, and the device has limited memory. I am not sure, but I suspect that the increased memory causes some side effect, maybe page swaps or page loads in the virtual memory manager.

You have an interesting application: a rather small device with limited resource. For that application, we need to find ways to limit the amount of memory that server or clients can use.

The quick fix is to limit the maximum amount of data "in flight". This is easy to implement by placing a cap on the congestion window (CWIN). The memory allocation should be capped to at most 3 CWIN per connection. This is not ideal, because it might limit performance, and also because memory consumption would still grow if there are many connections.

Another fix would be to rework the way we copy stream data for retransmission, so as to not "double the memory" is case of losses. This would be a bit more complex than the current solution, requiring some fancy memory management, but it seems doable. I opened issue #1504 to track that.

Yet another fix would be to re-examine the "confirmation of loss" queue. We need some way to track spurious losses, but the current way uses a lot of memory. There might be way to limit this, maybe looking at time in queue, etc. I opened issue #1505 for that.

Then we can consider some kind of limit "per server" -- maybe limiting the number of packets in flight at the server level. This is probably the best solution for your application, but also the hardest to implement properly.

huitema commented 1 year ago

@dongsf I just prepared PR #1506, which adds an API to limit the maximum value of CWIN. Can you test it? If you are using the test server picoquicdemo, the limit can be set by using the option "-W", as in:

picoquicdemo -W 2500000 -p 4433

If you are using your own server code, you will have to call the API picoquic_set_cwin_max(quic, bdp);, setting the BDP value to something compatible with the server memory resource and the desired data rate. The value 25000000 would work for a connection at 2Gbps with RTT of 2ms.

dongsf commented 1 year ago

1506 is able to prevent unlimited memory consumption,In cases of spurious losses, it still occupy 100MB of memory, but it is acceptable.

huitema commented 1 year ago

On the 100MB issue, I think fixing #1504 and #1505 will help. We also need to devise an algorithm to limit the overall memory consumption across multiple connections.

huitema commented 1 year ago

@dongsf Issues #1504 and #1505 are now closed. Most of the work was done in PR #1514, which in theory should reduce the peak of memory allocation in your scenario. If you have time, it would be nice to update picoquic, repeat the experiment, and see whether the memory consumption remains reasonable.

dongsf commented 1 year ago

Thank you for your continued attention. I have tested two scenarios:

  1. Merging the modified code onto the old branch;
  2. Directly testing using the new branch. Both test methods only used a small amount of memory (less than 30MB). However, during long-term testing, I discovered inconsistencies in memory allocation and deallocation. I monitored nb_data_nodes_allocated and nb_data_nodes_in_pool in real-time and found that the nb_data_nodes_allocated had many more nodes (thousands) than the nb_data_nodes_in_pool. These nodes were not released even when I stopped streaming downloads. Because the entire testing process was very long, the device was unable to save qlogs, and the server's qlog was very large. How can I locate these "missing" nodes?
dongsf commented 1 year ago

When the issue occurred, the memory usage also increased. At this point, I can only confirm that some data nodes but were not recycled. When I continuously tested for over an hour, nb_data_nodes_allocated was 8000 nodes more than nb_data_nodes_in_pool.

huitema commented 1 year ago

In the presence of packet losses, there is a queue of packets building up at the receiver. The nb_data_nodes_allocated will grow to match the bandwidth-delay product. At the peak, nb_data_nodes_in_pool could be almost zero, while nb_data_nodes_allocated would be several thousands. That part is not surprising.

However, when the transmission stops, the queue of packets at the receiver should gradually drain, while the last packets losses are beeing corrected. Most will be corrected after on RTT, but there could be secondary losses, etc. All should be corrected when the stream is completely received.

The error condition that you are looking at on the client would be:

1) The client has completely received all the data, and, 2) there is still a queue of packets, i.e., nb_data_nodes_allocated > nb_data_nodes_in_pool

You can try modify some of the tests in the test suite to verify that -- select tests that send lots of data with packet losses, and add a test after the end of the transmission. If the test triggers the condition, we have a repro!

huitema commented 1 year ago

I am puzzled. I added a test of the node allocation in the test suite, so that most tests check the value of the variables:


#if 1
        if (ret == 0 && test_ctx->qclient->nb_data_nodes_allocated > test_ctx->qclient->nb_data_nodes_in_pool) {
            ret = -1;
        } else 
        if (ret == 0 && test_ctx->qserver->nb_data_nodes_allocated > test_ctx->qserver->nb_data_nodes_in_pool) {
            ret = -1;
        }
#endif

All tests pass.

I wonder whether there is something different in the way you test. For example, how many streams you open, and whether these streams are closed (FIN received and FIN sent) when you measure the excess memory allocation.

dongsf commented 1 year ago

My stream processing flow is as follows: The client sends a STOP_SENDING message to the server (the stream is a unidirectional stream initiated by the server), and the server resets the stream when it receives the message. After receiving the RESET message, the client does not release the resources of the stream immediately. I looked at the operation of picoquic_delete_stream_if_closed, which requires that picoquic_is_stream_acked be satisfied before releasing. Currently, even after the server has reset the stream, there is still a large amount of stream data arriving. I found that the release of input stream nodes mainly relies on active release after triggering stream data callbacks. When the client receives a reset but the transfer is not complete, it needs to wait for the stream to complete the transfer and trigger the callback before releasing. In the case of an exception, it should be because the data after the stream reset was not transmitted correctly. I have not yet fully understood this implementation. The simplest solution is to directly delete the current stream resource after receiving RESET without checking whether picoquic_is_stream_acked is satisfied.

huitema commented 1 year ago

I will make sure to add that specific scenario in the test suite!

huitema commented 1 year ago

I think that this is fixed in PR #1520. @dongsf, can you check please?

huitema commented 1 year ago

PR #1520 adds a series of fixes. The main one is to properly delete stream context when they are not needed. This required:

huitema commented 1 year ago

PR #1520 fixes the way stream contexts are deleted. It fixes the test of whether streams are acknowledged, and fixes a regression in the handling of acknowledgement. It also adds an explicit "reset acked" frame, so that if a reset is sent, the stream context is kept alive until the reset is acknowledged -- which makes the transmission of "reset stream" frames robust in the presence of large error rates.

The "stop sending" and "discard" test were modified by increasing the simulated latency and the simulated packet loss rate, which reproduced the issue described here. It also evidenced an issue with repetition of reset stream frames. All these issues are now fixed, and the modified tests pass.

dongsf commented 1 year ago

I am sorry for replying too late. I have tested the new code in my own application. There are still some nodes that have not been released even after the "stop sending" and "reset" have been exchanged. It will also result in a connection error with protocol error code 5 in my test.

huitema commented 1 year ago

Can you get a qlog or some other trace? That would help me understanding what is going on.

In theory, nodes are release after the stream is closed, such as after reset. But something may be preventing that. Is the stream closed in both directions?

I fixed an error code 5 bug recently. It was tied to enabling the "preemptive repeat" option. Do you know whether that option was enabled in your build? Also, the version of picoquic in your build?

dongsf commented 1 year ago

I tested an early version 1.04b and the latest version 1.1.9.3, and both have similar issues. I even added a forced release of stream content outside of picoquic and counted the number of stream nodes and packets, which were normal. So the problem may be happening elsewhere. I captured a log at an abnormal time point, which occurred during the transmission of stream 319, around 269 to 275 seconds. At this time, there was a significant amount of memory usage when the client received the 319 data, which was not released afterwards. My client side device cannot store too many qlogs, so I only have the server-side logs here. 54b90253503eebf0.server.qlog.zip

huitema commented 1 year ago

Thanks for the trace. 290MB! That's big indeed.

What strike me is the delay graph: image

At some point, we have an RTT of over 8 seconds, and a huge number of packet losses. There are no internal queues in picoquic, so these delays have to happen somewhere in the network. Probably some large buffers in a simulator. That's a very unusual configuration.

I would expect these long delays and this large amount of packet losses to significantly increase the memory usage, both on the server with packet queues (elements of type picoquic_packet_t), and on the client with nodes queue. On the client size, the nodes queues should be freed after the reset is sent by the server and received by the client. But somehow this does not happen.

I see you mention stream number 319. That's a unidirectional stream created by the server. There may be an issue there, as my tests mostly use bidirectional streams. I will verify that code path, make sure that the RESET is handled correctly.

I did fix an "error code 5" issue in version 1.1.9.4. The latest release is 1.1.10.0, with support for the emerging Multipath QUIC specification.

It is late now, I will investigate further tomorrow.

huitema commented 1 year ago

I am looking at the trace, and I am noticing that you are using multipath. The endpoints negotiate multipath, the client sends acknowledgements on both paths, but client and server apparently send data on only one path. It seems that the server did not perform the connectivity check on the second path. This is not related to our issue, but it may be an issue in the multipath implementation, or maybe in the multipath usage at the client.

The application protocol involves exchanging data on two streams: a unidirectional stream started by the client, probably to send some kind of request message, and a unidirectional stream started by the server, probably to send the requested amount of data. Only the client sends data on the bidirectional stream. That's unusual, similar applications just use on bidirectional stream started by the client, with the server sending its responses on the same stream. That should not be an issue, except that the unidirectional streams are less tested than the unidirectional streams.

The test protocol seems to alternate between using short streams, a few kB long, and long streams -- I see 5 streams with final offset between 10MB and 100MB, versus a couple hundred short streams. This means the application will often make very little demand on the network, and increase it to a high value on rare occasions. That may explain the drop in the sending rate that we observe. BBR maintains a sliding window of data rates observed in the last 6 RTT, and picks the largest of that. If the data rate remains small for more than 6 RTT, it may end up selecting a low bit rate. That would be a bug in picoquic's BBR implementation, because the spec says "largest of last 6 RTT in which the bandwidth was fully used." I will need to verify that.

The RESETSTREAM frames include a final offset parameter, which indicates the largest amount of data sent on the stream. For stream 319, that value is 34483370. I will check that these doesn't cause the server to keep sending copies of old frames with a lower number. (Update: it doesn't. There is no frame with stream-id = 319 sent after that reset. This is not an issue.)_

The existing tools do not deal well with large logs. I have to write a couple of python scripts to analyze the data in detail, maybe looking at exactly what happens with stream 319.

huitema commented 1 year ago

After parsing the qlog file, I see that the "reset_stream" frame with "stream_id = 319" was sent once at T=281935147, was not acknowledged by the client, and was not repeated by the server. That means the client never received it, and thus never closed the stream 319 and never release the nodes. I suppose that's the root cause of what you are seeing.

dongsf commented 1 year ago

I didn't merge the reset_ack code to my test code( 1.04b). To ensure that the nodes in the stream are properly released, I forcefully deleted the stream a few seconds after sending stop_sending. I also added some printf logs to monitor the overall number of nodes, packets, and nodes in the memory pool of quic in real-time, and these values can return to normal (8912), so the memory leak may occur elsewhere. I will continue testing on the new version and send the client's binary log over the network to solve the problem of the client being unable to store a large number of qlogs. I will sync with you later on the new test results。

dongsf commented 1 year ago

Regarding your mention of only sending data from one path during multipath transmission, the client's configuration should be normal. When I tested it before, there was data on both paths.

huitema commented 1 year ago

The reset ack code is pretty important in practice, because without that we can have one end believing the stream is still there, and another believing it is gone. For the multipath, we have an interop test tomorrow before the IETF, so I will see if the situation repeats.

I added a test that the reset is properly repeated, and it passes. But that depends on an up to date code...

dongsf commented 1 year ago

I have tested with the latest code, and here is a binary log. The issue occurred during the transmission of stream 411, around 565 seconds, where the callbacks for the stream received by the client suddenly stopped triggering, and a large amount of memory was being used. Even after the release of stream 411, the system still showed that the memory was being used. I also added some printf logs, and the number of allocated nodes was 8192. There are some logs that I added myself to track memory usage, which can be ignored. b15cfbfcf60ad4a6.zip

dongsf commented 1 year ago

I noticed that when stream 411 on the server-side received stop_sending, it was still transmitting data for a few seconds, which is abnormal.

huitema commented 1 year ago

I am busy attending the IETF this week -- remotely, on line, but still busy. I will try to work again try to repro your scenario. Yes, the retransmission should have stopped immediately. In the previous log that you sent, they were, but maybe there is some internal delay happening. I will try to understand why.

dongsf commented 1 year ago

Thank you for your continued attention. The CPU performance of my client end is very limited and it needs to handle the reception and transmission of QUIC and other tasks within a single thread. At the same time, I have not enabled GSO (client does not support it). In this case, it feels like there are some distortions in calculating RTT and flow control, which often leads to the client end caching tens of thousands of stream nodes (some early packets cannot be received). When this happens, there are often some unreleased memory, but the node has already been released (which can be judged by printing the number of nodes). This is a very strange issue because I have not found any other places where memory is being allocated.

huitema commented 1 year ago

I think that's the root cause of the problem: the client is limited, the server overpowers it, and we end up exploring a lot of rarely used code paths. In the short term, the client should limit the server by using flow control. This requires the client-side equivalent of the "max window" parameter, to set the max flow control window. I will try to do that this week.

In the long term, we should verify that picoquic is robust and make sure that all those "rarely used code paths" are well tested. This requires adding a simulation of a performance-limited client, writing a series of tests that explore its behavior, and probably finding or fixing bugs. I don't know how many, probably half a dozen given the size of the code and past experience. I want to start that soon, because it is important that picoquic runs well on small devices, but it needs a couple of weeks to complete.

For the simulation, it might be good to specify in which way the client is limited. Is it CPU, limiting for example the maximum number of messages per second, or is it memory size?

huitema commented 1 year ago

Actually, the "flow control" API is already there in picoquic.h:

/* picoquic_set_max_data_limit: 
* set a maximum value for the "max data" option, thus limiting the
* amount of data that the peer will be able to send before data is
* acknowledged.
* This option can be used to control the amount of memory that the
* receiver will use to reorder received data frames. This control is
* indirect: the receiver always allocate a full packet size for incoming
* packets, even if they are small. In tests, we see the allocated memory
* per connection vary between 2 and 3 times the max data value.
* Setting the value to 0 (default) means that the "max data" limit
* will rapidly increase to let transmission proceed quickly.
*/
void picoquic_set_max_data_control(picoquic_quic_t* quic, uint64_t max_data);

Could you try setting that on your client? I understand that defining the exact correct value is difficult, but starting with "max packets per second times max packet size" should be a start, with "max packets per second" corresponding to what clients can reasonably process?

huitema commented 1 year ago

I am writing a simulation of the "limited client", to watch how different congestion control algorithms manage a slow client. The first issue that I see in these simulation is "batching". Since the server is sending faster than the client can read, the client's UDP socket queue is always full. The original simulation code followed a naive strategy of "read everything that's available, then see if there is something to send". The client can send a very long time reading a bunch of packets before sending the ACK that will inform the server of packet losses. The simulation was fixed to break those batches. This batching issue was already fixed in the default packet loop, sockloop.c, used in picoquicdemo with API described in picoquic_packet_loop.h.

There is probably a second issue happening in your tests. In your tests, the client is sending a series of small requests, and the responses are processed immediately. The queues are cleared quickly, so the server ends up believing all is fine. But I think there may be a bug in the implementation of Hystart, causing the congestion window to keep growing and reach very large values. When the client finally sends a large request, the extra large window could allow the server to send a very big batch of packets, flooding the client and causing a big congestion event. I will try to also simulate that.

huitema commented 1 year ago

I have now completed a test with the following parameters:

1) Bandwidth set to 100 Mbps 2) Latency set to 50 ms per link (RTT=100ms) 3) Server congestion control set to BBR 4) Client programmed to spend 2ms to process each packet that it sends or repeat.

The log shows nominal behavior. This is a visualization of the QLOG traces of the server: image

BBR properly discovers that the client is limited, and stabilizes at a packet rate that matches the processing rate of the client. I do not observe any memory leak.

That said, there is still an issue of packet losses, in particular when using Cubic. A simple test of downloading 2MB on this 100Mbps network with Cubic shows 121 packet losses, compared to 31 for Reno and 46 for BBR. If we want to reduce these losses, we probably need to implement a stricter form of flow control.

huitema commented 1 year ago

Flow control can limit the number of packet losses experienced caused by the limited client, but setting the flow control parameter correctly is tricky. In one test, we set the max_data parameter at the client to 57344 bytes.

image

The server is running Cubic. In the same conditions, the download without flow control completed in 4.1 seconds, but we saw 121 packet losses. The following table shows how packet losses and download time vary with different settings:

Max data Losses Download time
not limited 121 4.07
131072 70 4.05
98304 21 4.06
65536 4 4.6
57344 0 5.1
49152 0 5.8
32678 0 8

The relation between max data, packet losses and download efficiency depend of variables like the maximum size of the UDP socket queue, the RTT, the underlying data rate, or the time it takes to process a packet. These variables are well known when setting up a simulation, not so much when running on a real network. The packet processing time could be measured in the lab, but the other values are variable.

While running these simulations, we applied two fixes to the code:

1) When the max_data parameter is set, proceed with increments of 1/4 of the maximum value, instead of 1/2. This is in order to deal with server implementations that could send the whole increment as a batch of packets, and create instant congestion in the receiver queue. The value 1/4 is arbitrary. We may need to program a smaller value still when max_data is very large.

2) Make sure that setting max_data also sets the initial value in the default transport parameter of the QUIC context and in the transport parameter of all the client connections that are not yet started.

dongsf commented 1 year ago

The cubic may have bug in my scene. It caused the stream_max_data too much bigger than max_data. I have add a limit in the stream_max_data frame, and the memory's occupation goes down. But There are some other isssues while testing the new code. This issue should be closed.