Prevent memory leak at high throughput

jfrey-xx commented 4 years ago

So, during my various tests in #70 I found out that in some situations the memory of the program pushing samples would grow more and more, consuming all the RAM until the process was killed by the kernel to prevent a total collapse of the system (still talking about Linux platform here).

This issue occurred when samples were being pushed as fast as possible (the "unlimited" situation described in this other pull request).) One could argue that this is a limit case, that the software is not supposed to push that many samples, but I do no think a library should prevent misuses to provoke crashes.

At first I thought there was a memory leak, some memory not freed, but as it happens this behavior is to be expected. Indeed, from my understanding (that is still fuzzy I'm sure, despite the many hours diving in the code), the factory class (from sample.h) handles the buffer that is being used to create new samples when data is being pushed. There is a fixed (circular) buffer which is being pre-allocated (storage), which size depends on the sampling rate and on the configured buffer size. However, if too many samples are being created, the buffer is likely to be full and new samples are being created by allocating new memory, outside of the storage (e.g. in factory::new_sample()). The problem is that once these samples are "consumed" (they have been sent/popped in consumer_queue), the memory is not freed (delete is not being called). Instead samples are "reclaimed" and are added to factory's circular buffer (this occurs when the corresponding smart pointer intrusive_ptr has no more reference to the sample). The advantage is that the more the samples the less likely the buffer will be empty again, the inconvenience is that in a situation where samples keep pilling up (as with a stupid software sending as fast as possible), memory keeps increasing until soft crashes.

You know all of that, and better, but I had to write it down to organize my thoughts a bit, as it was at times difficult to wrap my head around the code, even though in the end it made sense. It might help others to grasp the logic, it is also a way to maybe highlight that there is indeed a behavior that was not anticipated, since the explicit delete is never called (or at least I did not find in which situation).

So, naively, I thought that all it took was to actually call this delete in case the sample was allocated outside of factory's storage; for instance to call delete in sample::intrusive_ptr_release (the method being called each time there is one less smart pointer pointing to the sample). While it worked for some time, ranging from seconds to minutes, with a high throughput the program would crash due to a "double free" error. Digging more and more, it happens that the spsc_queue used to implement consumer_queue's buffer is not thread safe: as per the doc, two concurrent threads should not pop elements from it. As a matter of fact, in LSL there is a race condition when too many samples are being pushed / pulled: at some point some code in consumer_queue::push_sample(), which is supposed to dispose of the oldest sample in case this buffer is full, collides with another thread calling consumer_queue::pop_sample(). As a result, it can occur that the exact same sample is popped twice at the same time, creating temporarily two smart pointer intrusive_ptr pointing to it, in two different threads, two pointers that could go out of scope at the same time, reaching a situation where the refcount_ upon release in intrusive_ptr_release will go from 1 to 0 two times, hence twice the delete and other messy things what a climax!

So, not only spsc_queue is not thread safe, but, yeah, it seems that intrusive_ptr is not totally thread safe either, even when I tried to use the ad-hoc intrusive_ref_counter class (and is over-optimistically called "thread_safe_counter" policy) to handle the "add" and "release" methods. I reckon I am being too specific here, just know that I tried quite a few things to avoid a mutex. I also considered switching the implementation from spsc_queue to queue, the latter being thread-safe from the doc, unfortunately the requirements cannot be easily fulfilled at the moment (e.g. it would be necessary to find an alternative to intrusive_ptr, and it seems difficult to have a "trivial" destructor with the way memory is allocated at the moment for sample).

In the very end, I went with using a mutex at the level of consumer_queue, to ensure that the buffer would not be popped twice at the same time, and, overall, buffer consistency. Problem solved, and the least intrusive solution. Note: It is true that this fix only prevents too much memory to be allocated beyond factory's storage, but if said storage it too big, due for example to a high declared sampling rate, the buffer might still try to allocate more memory than available. Not sure how to tackle this one, though, as ideally buffer size should be modulated at run time. One step a time

Because, as a matter of fact, I also used a mutex in #70, this pull request is based on the same code. I hesitated to base the code on current master branch, i.e. add the lock around current polling method, but I thought that it would then make it more complicated to merge both pull requests. However if for some reason you would not merge #70 but still accept the mutex to fix the "leak", I will however change the code so it cam be merged to master.

I did the same benchmark as in #70, comparing CPU consumption on different machine and sampling rate, I can share the result here if you want, but in a nutshell I did not observe any impact on performance between this code and the other pull request (there is, after all, only one additional "lock" when an old sample needs to be disposed of).

jfrey-xx commented 4 years ago

Note: this branch also resolves issues with the new benchmarks, e.g. lsl_bench_exported "pushpull - float" which provokes a SIGABRT with error double free or corruption.

jfrey-xx commented 3 years ago

Hello,

Is there any chance that this pull request could make it to master before 1.14 is released? I am now using this code in production, a use-case where I have on a Pi4 about 10 outlets and as many inlets at the same time, with very little strain on the CPU or on the memory, something impossible without the changes. Just in case, I synced with the latest master branch :)

As usual, is I can do something to ease the integration, I'll try to do my best!

chkothe commented 3 years ago

Hey Jeremy, thanks again for submitting this, much appreciated -- looks like very thorough work, but it'll take me a while to fully vet. As I'm sure you know, safe memory reclamation in a multi-threaded context (esp. lockfree, as it was) is difficult enough that you can get a paper accepted at ACM SIGACT. :)

Tbh I don't think we can (or should) do it for 1.14 since that release is already in the 4th beta, but the next thereafter could be a target to resolve this area.

jfrey-xx commented 3 years ago

Thanks for the update! I thought that the various unit tests would be enough to catch any rogue behavior, but even if not many lines changed I understand that you want to be extra cautious with this part of the code -- any regression here could be quite harmful indeed.

Actually to me the most important feature would be the reduction in CPU brought by the (tiny) changes in #70 (that are also included in this branch), so I'll still cross my fingers for that one ;)

tstenner commented 3 years ago

Note: this branch also resolves issues with the new benchmarks, e.g. lsl_bench_exported "pushpull - float" which provokes a SIGABRT with error double free or corruption.

That's because the producer removing samples until there's free space to enqueue new samples means there's a single consumer (the SC part in SPSC), but two consumers so there's a race condition. This either results in a double free or an infinite loop upon free()ing the memory pool. I haven't heard about this happening outside of the (very synthetic) unit test, though.

Is there any chance that this pull request could make it to master before 1.14 is released?

The only way I could see this happening is behind an #ifdef for the dev-speed branch. On the raspi it's a clear win, but at least on one of the computers I tested it on the throughput dropped quite a bit.

jfrey-xx commented 3 years ago

Memory: I first saw a problem when I was pushing samples without limiting the sampling rate (e.g. no "delay" in the loop dealing with the outlet), I guess it could happen to a few people out there, but it's indeed an limit (and improper?) use-case.

CPU: I would be very fine at the moment with a flag that I could set upon compilation :) It would also make sense to add a condition so that if the declared sampling rate is >~ 1000hz the regular code is used, whereas if the sampling rate is less than than (especially below 100hz, or with markers stream), then the proposed code is used instead. More flexible, more cumbersome :man_shrugging:

chkothe commented 3 years ago

On that note, I happened to look at #70 yesterday and just put a comment in that thread.

chkothe commented 3 years ago

Hey @jfrey-xx so I finally got around to diving into this PR, sorry it took me so long! (busy times...) First of all, I'm glad someone caught the issue where the consumer_queue ceases to be thread safe when the buffer size limit is reached (actually surprised that no one apparently managed to trigger it before). Since the fix for that meshes so well with the timed wait to reduce CPU load and latency, I think it would be best to address that problem along with getting #70 merged.

Now on to the memory issue. Let me briefly recap how the sample memory management is supposed to work. LSL uses an arena allocator and has a preallocated memory pool from which allocation/deallocation are very fast. The size of that pool can be set by the user in the lsl_api.cfg, and it's by default quite small (5 seconds worth of data if you have regular sampling rate, or 128 samples if irregular). This is because some use cases involve giant samples (eg uncompressed video) and so we don't want to reserve unnecessary amounts of memory. Rather, the pool is sized to be large enough that it doesn't get exceeded in typical business-as-usual operation, even if there are some few-second stalls on the network or the receiving app.

Now, as you learned the hard way, the library can allocate (much) more memory than that initial buffer, and the limit for that can be configured by the sender (at outlet construction) and overridden by the receiver (at inlet construction), but very few applications do that. The default limit is 6 minutes of data if you have regular sampling rate, or 36k samples if irregular (per outlet), but applications can choose to go as low as a few seconds of data if they want. Did you try adjusting that limit to see if that limits your memory use? Btw. the rationale for that default was as that, when you're recording, you don't want to lose any data due to things like the WiFi router rebooting or other lengthy network interruptions, but it can also help when the receiving program is slowed down by a big background CPU load spike -- so the buffers need to be able to grow large enough to accommodate that kind of stall. Now, normally LSL should not allocate more samples than that (because the consumer_queue will discard the oldest samples if you have more than that), and we'll want to verify that that's the case (I'll get back to that).

Now what most likely happened on your end would be the following: if you push samples as fast as the thread can do it, most likely that will fill up that buffer to its default 6-minute limit, because the network transmission is going to be slower than you can construct and push in samples. This can easily exceed the memory on a Raspberry Pi if the samples are more than a few channels (what was your channel count, channel format, and declared nominal sampling rate again?). The result of the send rate exceeding the network transmit limit is that the data recipient will a) get samples that are a few minutes old (the tail of a very large buffer) and b) lose a significant number of samples, so that wouldn't make sense either for a real-time use case or for a recording use case, i.e., it's purely theoretical -- it's good that it revealed the crash bug though!

Last bit: let's go over the memory reclamation aspect of the PR now. So when you destroy the outlet object, it will deallocate all samples that it allocated, i.e., you should see that all your memory is freed properly. So there wouldn't be a memory leak in the usual sense (forgot to call delete), unless there's some unaccounted-for memory use (that's why I want to make sure we do calculate how much was accounted for by the buffer). The more interesting question is what should happen when the buffer has grown beyond the initial 5 seconds and samples are now no longer needed and deallocated -- should they be freed to the OS (via delete) or should they be recycled into the pool for later use (until the outlet is destroyed)? Or should there be a flag to control it? So the current implementation recycles those samples into the pool, and that was for the following reasons: deallocating those excess samples to the general-purpose OS allocator is a lot slower than using the lockfree allocator, and if the buffer use stays for a while above that 5-second limit, or repeatedly spikes above it, new samples have be allocated again repeatedly from the OS (again at much higher cost). So that would mean that the buffering would be a fair bit slower once you're hovering above the magic 5-second mark, and the max throughput would go down with it. Worst case, if the backlog was due to the excessive load, that situation would be made worse by hitting the OS allocator twice with each sample, and it could end up in a regime where it can't catch up to real time anymore and will eventually hit the max buffer limit. So instead of doing it that way, the library assumes that "if the application/setup hits a certain amount of buffered backlog once, chances are that it'll need it again later" (whatever the cause) and thus it recycles the allocated samples to the pool, so that, if the network or application ever again stalls, allocating those samples from the reserved pool (and deallocating them to it once no longer used) is extremely fast and the throughput remains steady, i.e., only the initial load spike comes with more overhead. At the end, once the application no longer needs the outlet and destroys it, the whole buffer is deleted. Now to be fair, some people's applications run for days, and if there was one massive stall once, the app will keep the memory reserved from then on until it's shut down. So that could be an argument in favor of having an lsl_api.cfg switch to control that (not sure if anyone really runs into that problem and needs the switch, but it could be useful and certainly makes it more controllable...).

Now with all that covered, there's still the possibility that there's a true memory leak somewhere else in the system, so maybe we should do some more benchmarks to probe the behavior at the throughput limit.

edit: I found the specs for your test case in the other PR, and yes, at 100000 Hz x 360 seconds x (8 channels x 4 bytes + 32 bytes of overhead IIRC), that comes out at a max limit of 2.3GB of memory for the buffer, so that'd fill up the Raspberry Pi several times over. I guess that would explain how the system ran out of memory.

jfrey-xx commented 3 years ago

Thanks for the very detailed explanation! I did not get that the buffer could be re-allocated somewhere else, for another outlet -- nor that there were different location were the buffer size was set. Very useful to really get to know LSL inner workings :)

In the issues I encountered, the "memory leak" occurred first on my desktop machine (i7-7600U), with 8GB of RAM. I don't recall how much I had free at the time, but more than half of it, so in theory enough to hold the 2GB+ buffer, but the RAM allocated by the process kept growing and growing until the kernel killed it. Since with the fix in this branch the memory stops to grow at some point, presumably when the 6 minutes buffer is reached, I would say that there is definitely a leak with the original code.

In #70 the "bad alloc" error I got specifically on the Pi is related to something else, probably the default 5s buffer (or whatever buffer is first allocated) that would immediately fill the heap, hence why I had to reduce to 10khz for the example to even run (and then the same behavior, without the fix the memory keeps growing until the kernel steps in).

chkothe commented 3 years ago

I had an opportunity today to replicate your test case (besides a few others).

So I'm using 8 channels float32, 100KHz declared sampling rate, chunk size of 1000, a thread that pushes data into an outlet as fast as the outlet can take it, and another thread that reads from the inlet as fast as it can (that's on Ubuntu 20.04 on a laptop).

What I'm observing, using the thread-safe version of the consumer_queue, is that the memory grows linearly until about 3.8GB are reserved (and a total of about 4.5GB of virtual memory have been allocated), and then the memory no longer increases (at all).

In terms of data rates, the pull_sample thread can't keep up with the push_sample thread, and therefore, buffers start filling up. Specifically, the buffer on the outlet ultimately hits its max capacity, which at 6 minutes of 100KHz (declared) sampling rate comes out at 36 million samples or 2.9GB (I had mis-calculated it originally). It then remains at that size and doesn't grow any further. Now the reason why the actual usage at the end stops at 3.8GB is that in addition to the allocated samples, there are also the pointers to those samples, which is what's stored in the consumer_queue (a fixed size ring buffer of the max buffered length, i.e., in your case 36M pointers at 8 bytes each), and if you have both inlet and outlet on the same machine, that gives two of these large consumer queues, which comes out at an additional 576MB, so the sum total is then ~3.4GB. Note that on the inlet side, the buffer never fills up because the receiving thread pulls out the data faster than the network can bring new data. Someone would have to go search where the other 400MB come from, but some of it is probably the 5-second buffer that's pre-allocated on the inlet and maybe some more buffers used by asio, the C++ runtime, and so on. Now, one phenomenon besides the send buffer overflowing is that the ring buffers are initially allocated only in virtual memory, and only once the read/write cursor have made it across the whole buffer once, it'll be all paged into the process (so that's when the memory growth stops).

Note that this is without the extra delete call in this commit.

So in a nutshell, that does explain why your laptop ran out of RAM, but it must have been a fairly close call. It also shows that the 6-minute default, while neat in theory, can cause some quite unexpected behavior at high data rates (it's basically the equivalent of holding a 6-minute recording in RAM, plus some overheads) and that may be a reason to lower that default in a future version, or at least cap it at a fixed MB limit that can be set in the API config file.

In terms of what kind of data rate can be actually sustained, on my machine the fastest rate at which I can send and receive 8-channel float data without the inlet falling behind is ~3MHz with the lock-based queue (and 4.2MHz with the sleep-based one). In that case, there's never more data being allocated as the initial 5-second sample pool (plus the 360-second ring buffers of pointers). In contrast to that, the max rate at which push_sample can be called was ~4.8MHz with the lock-based queue (and ~8.2MHz with the sleep-based one), but as we saw, the library can't transmit data as fast as that.

jfrey-xx commented 2 years ago

[I realize only now that this is still a pending PR, even though the relevant code is either already merged or outdated. I'm closing that]

cboulay commented 2 years ago

@jfrey-xx , thanks very much for the cleanup. BTW, as someone who is interested in high throughput, I'd love to get your thoughts on the synchronous-push effort. Some discussion here and here But the latest version of the source is in @tstenner 's branch: https://github.com/sccn/liblsl/compare/master...tstenner/outlet_sync (edit to fix last link)

jfrey-xx commented 2 years ago

Actually I did follow these discussions but I'll admit that it went low-level a bit too much for me to really have a grasp about the pros and cons. I'll try to think of some use-cases I could have to test that properly (especially on low-end systems) and I'll poke you there :)

sccn / liblsl

Prevent memory leak at high throughput #71