kriszyp / lmdb-js

Simple, efficient, ultra-fast, scalable data store wrapper for LMDB
Other
484 stars 39 forks source link

Potential memory leak #75

Closed vladar closed 2 years ago

vladar commented 2 years ago

I am investigating a memory leak in one project and suspect that it might be caused by lmdb-store somehow.

The leak is not happening in v8 - it occurs somewhere in native C++ code (I see big RSS values with normal Total V8 Heap values).

Also, I realize that when dealing with LMDB we should check for "dirty" pages only. I do not rely on RSS readings alone - using pmap -x to monitor the actual "dirty" memory of the Node process.

So, I am sampling memory in Node and see the following results:

heap total: 903 MB; external: 923 MB (buffers: 920 MB)

Up to a point heapTotal + external matches pretty closely dirty value returned by pmap. But then at some point, I see this:

heap total: 1.0 GB; external: 173 MB (buffers: 166 MB)

So essentially Node releases 750MB of buffers but dirty value from pmap doesn't change. RSS doesn't change as well. Eventually the process OOMs.

My best guess is that those allocated buffers are somehow retained in lmdb-store but I may be completely missing something. I didn't investigate memory leaks in native code before, so not sure what would be the next step to confirm or discard this hypothesis.

Do you think it is possible?

P.S. some context: the code causing those allocations does long-living cursor iterations (may traverse 100,000+ items) with corresponding gets. Also, this mostly happens in a sync manner (sadly parts of the code are in the user-land and we don't have control over it).

kriszyp commented 2 years ago

Hmm, I'm not sure, here are some thoughts/questions I have:

In general, almost all of the memory allocations involving lmdb-store are during writes, not reads. Serializations require buffer allocations, and LMDB has to allocate pages, and none of that is needed during reads.

vladar commented 2 years ago

Are using getBinary to retrieve data?

No, it is a default msgpackr (BTW, I notice much higher memory usage when switching to json although it is not related to OOM, just a side note)

Is this related at all to using the temporary noSync+useWritemap database?

Nope, not using it yet. Just regular dbs and a couple of dupSort dbs

Do you think this is a new memory leak in v1.6?

I don't think so. I tried it with lmdb-store@1.5.5 and see the same effect (I am not even sure if it is related to lmdb-store but this problem doesn't occur with our in-memory store)

Of course cursors and their corresponding read-txns have some small amount of memory associated with them. Is it possible that there are enough open cursors to lead to memory accumulation?

That's my main suspect too. What would be the good way to track the number of open cursors / read transactions?

When the process OOMs, is it the V8 CALL_AND_RETRY_LAST error (and is it affecting by --max_old_space_size) or somehow running out of swap space or something else?

It actually gets killed by the OS OOM killer, I'll try to set vm.overcommit_memory to 2 the next time I try it. Will try playing with max_old_space_size too.

I will likely get back to this by the end of the week, will try to come up with some repro. I was hoping the fact that the drop of the external memory doesn't translate to the drop in RSS/dirty memory will give some hints.

P.S. Not sure, maybe we actually see this issue here: https://github.com/nodejs/node/issues/38300 (Edit: probably not as I am testing with Node 14.17.3 that should contain a fix for this issue)

kriszyp commented 2 years ago

That's my main suspect too. What would be the good way to track the number of open cursors / read transactions?

For read txns, you can check with store.readerList(), which should return a string listing the open read txns. However, by default, LMDB has a maximum of 126 open read txns (unless you have changed it, lmdb-store doesn't change the default by default), which seems unlikely to take more than about 1 MB of memory even if fully maxed out. There is also a store.readerCheck() method that requests that LMDB clean up any orphaned read txns from any processes that have died/ended.

There is no public API for the number of cursors, although you could set a breakpoint in lmdb-store/index.js to look at the cursorTxns variable that holds an array of transactions dedicated solely to cursors.

Our lmdb-store server/app certainly tests cursor usage heavily, when we run reindexing operations, we make millions of getRange()/cursor traversal calls, and I think we typically don't see any more than 100MB of external buffer memory usage (whereas our server process will easily consume 40GB of RSS for the shared memory maps, but none of that is dirty, and the OS doesn't complain at all, merrily reclaiming any memory needed for other processes). However, we rarely have any overlapping cursor traversals (there are almost all basically directly converted to arrays), and for any long lived cursor traversals, I use snapshot: false in order to prevent a long-lived read txns from preventing free-space reuse in the LMDB database (that doesn't cause OOM, but can cause your databases to get unnecessarily big). I could do some load testing of overlapping cursor traversals to see if there a bug/leak there (based on your Unref() issue, it sounds like you do a lot more overlapping traversals than we do).

vladar commented 2 years ago

I couldn't repro this with lmdb-store alone, so closing as I don't see anything actionable here.

Also, NodeJS has several similar issues, some of them linked at the bottom of this thread: https://github.com/nodejs/node/issues/31641 (for anyone experiencing similar symptoms)

pieh commented 2 years ago

For anyone reading - don't take this is advice - I'm not fully aware of consequences


Continuing this, in my search for answer to this question I stumbled on https://github.com/nodejs/help/issues/1518#issuecomment-991798619 where trying different allocator was suggested.

On a test run I did try jemalloc (over default glibc malloc) ... and it seemed to help - RSS kept growing in general but it was actually released periodically (tho I got to say that I had some forced gc() calls that were there together with process.memoryUsage() logs which was related to part of my ongoing work to make sure Node's heap doesn't increase endelessly - so just finding and fixing leaks in my JS code).

This suggest to me that there is no measurable leak here and the problem is that default allocator just "perform worse" for the type of work we are doing (?).