Closed vladar closed 2 years ago
Hmm, I'm not sure, here are some thoughts/questions I have:
getBinary
to retrieve data? That does involve a buffer allocation, although it should be a relatively standard NodeJS buffer allocation.noSync
+useWritemap
database? I would think the data in those databases, and their corresponding mapped memory, would generally be "dirty" pages since the whole idea is that they are not proactively written to disk.CALL_AND_RETRY_LAST
error (and is it affecting by --max_old_space_size) or somehow running out of swap space or something else?In general, almost all of the memory allocations involving lmdb-store are during writes, not reads. Serializations require buffer allocations, and LMDB has to allocate pages, and none of that is needed during reads.
Are using
getBinary
to retrieve data?
No, it is a default msgpackr
(BTW, I notice much higher memory usage when switching to json
although it is not related to OOM, just a side note)
Is this related at all to using the temporary
noSync
+useWritemap
database?
Nope, not using it yet. Just regular dbs and a couple of dupSort
dbs
Do you think this is a new memory leak in v1.6?
I don't think so. I tried it with lmdb-store@1.5.5
and see the same effect (I am not even sure if it is related to lmdb-store
but this problem doesn't occur with our in-memory store)
Of course cursors and their corresponding read-txns have some small amount of memory associated with them. Is it possible that there are enough open cursors to lead to memory accumulation?
That's my main suspect too. What would be the good way to track the number of open cursors / read transactions?
When the process OOMs, is it the V8 CALL_AND_RETRY_LAST error (and is it affecting by --max_old_space_size) or somehow running out of swap space or something else?
It actually gets killed by the OS OOM killer, I'll try to set vm.overcommit_memory
to 2
the next time I try it. Will try playing with max_old_space_size
too.
I will likely get back to this by the end of the week, will try to come up with some repro. I was hoping the fact that the drop of the external
memory doesn't translate to the drop in RSS/dirty memory will give some hints.
P.S. Not sure, maybe we actually see this issue here: https://github.com/nodejs/node/issues/38300 (Edit: probably not as I am testing with Node 14.17.3 that should contain a fix for this issue)
That's my main suspect too. What would be the good way to track the number of open cursors / read transactions?
For read txns, you can check with store.readerList()
, which should return a string listing the open read txns. However, by default, LMDB has a maximum of 126 open read txns (unless you have changed it, lmdb-store doesn't change the default by default), which seems unlikely to take more than about 1 MB of memory even if fully maxed out. There is also a store.readerCheck()
method that requests that LMDB clean up any orphaned read txns from any processes that have died/ended.
There is no public API for the number of cursors, although you could set a breakpoint in lmdb-store/index.js to look at the cursorTxns
variable that holds an array of transactions dedicated solely to cursors.
Our lmdb-store server/app certainly tests cursor usage heavily, when we run reindexing operations, we make millions of getRange()/cursor traversal calls, and I think we typically don't see any more than 100MB of external buffer memory usage (whereas our server process will easily consume 40GB of RSS for the shared memory maps, but none of that is dirty, and the OS doesn't complain at all, merrily reclaiming any memory needed for other processes). However, we rarely have any overlapping cursor traversals (there are almost all basically directly converted to arrays), and for any long lived cursor traversals, I use snapshot: false
in order to prevent a long-lived read txns from preventing free-space reuse in the LMDB database (that doesn't cause OOM, but can cause your databases to get unnecessarily big). I could do some load testing of overlapping cursor traversals to see if there a bug/leak there (based on your Unref() issue, it sounds like you do a lot more overlapping traversals than we do).
I couldn't repro this with lmdb-store
alone, so closing as I don't see anything actionable here.
Also, NodeJS has several similar issues, some of them linked at the bottom of this thread: https://github.com/nodejs/node/issues/31641 (for anyone experiencing similar symptoms)
For anyone reading - don't take this is advice - I'm not fully aware of consequences
Continuing this, in my search for answer to this question I stumbled on https://github.com/nodejs/help/issues/1518#issuecomment-991798619 where trying different allocator was suggested.
On a test run I did try jemalloc
(over default glibc malloc) ... and it seemed to help - RSS kept growing in general but it was actually released periodically (tho I got to say that I had some forced gc()
calls that were there together with process.memoryUsage()
logs which was related to part of my ongoing work to make sure Node's heap doesn't increase endelessly - so just finding and fixing leaks in my JS code).
This suggest to me that there is no measurable leak here and the problem is that default allocator just "perform worse" for the type of work we are doing (?).
I am investigating a memory leak in one project and suspect that it might be caused by
lmdb-store
somehow.The leak is not happening in v8 - it occurs somewhere in native C++ code (I see big RSS values with normal Total V8 Heap values).
Also, I realize that when dealing with LMDB we should check for "dirty" pages only. I do not rely on RSS readings alone - using
pmap -x
to monitor the actual "dirty" memory of the Node process.So, I am sampling memory in Node and see the following results:
Up to a point
heapTotal + external
matches pretty closelydirty
value returned bypmap
. But then at some point, I see this:So essentially Node releases 750MB of buffers but
dirty
value frompmap
doesn't change. RSS doesn't change as well. Eventually the process OOMs.My best guess is that those allocated buffers are somehow retained in
lmdb-store
but I may be completely missing something. I didn't investigate memory leaks in native code before, so not sure what would be the next step to confirm or discard this hypothesis.Do you think it is possible?
P.S. some context: the code causing those allocations does long-living cursor iterations (may traverse 100,000+ items) with corresponding
gets
. Also, this mostly happens in a sync manner (sadly parts of the code are in the user-land and we don't have control over it).