kriszyp / lmdb-js

Simple, efficient, ultra-fast, scalable data store wrapper for LMDB
Other
486 stars 39 forks source link

pointer being freed was not allocated #48

Closed devongovett closed 3 years ago

devongovett commented 3 years ago

When using lmdb-store from node worker_threads, I sometimes get crashes with a message like

node(12601,0x172e3f000) malloc: *** error for object 0x120408000: pointer being freed was not allocated

I did a debug build of lmdb-store, and managed to get the process to stop in lldb. Here's the backtrace:

  * frame #0: 0x0000000190ac4e78 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000190af6b98 libsystem_pthread.dylib`pthread_kill + 292
    frame #2: 0x0000000190a3f460 libsystem_c.dylib`abort + 104
    frame #3: 0x000000019092804c libsystem_malloc.dylib`malloc_vreport + 560
    frame #4: 0x000000019092b61c libsystem_malloc.dylib`malloc_report + 64
    frame #5: 0x000000019091a78c libsystem_malloc.dylib`free + 516
    frame #6: 0x0000000104b7ddf0 lmdb-store.node`mdb_midl_free(ids=0x0000000120408008) at midl.c:117:3
    frame #7: 0x0000000104b69c04 lmdb-store.node`mdb_txn_end(txn=0x00000001058a3c00, mode=0) at mdb.c:3610:3
    frame #8: 0x0000000104b6ac1c lmdb-store.node`mdb_txn_commit(txn=0x00000001058a3c00) at mdb.c:4462:2
    frame #9: 0x0000000104bca6d0 lmdb-store.node`BatchWorker::Execute(this=0x0000000104e57990, executionProgress=0x0000000172e3ef30) at env.cpp:365:18
    frame #10: 0x0000000104bc7e74 lmdb-store.node`Nan::AsyncBareProgressWorker<char>::Execute(this=0x0000000104e57990) at nan.h:2107:5
    frame #11: 0x0000000104bc9be0 lmdb-store.node`Nan::AsyncExecute(req=0x0000000104e57998) at nan.h:2280:11
    frame #12: 0x00000001008ebca4 node`worker(arg=0x0000000000000000) at threadpool.c:122:5 [opt]
    frame #13: 0x0000000190af6fd4 libsystem_pthread.dylib`_pthread_start + 320

We verified that this only happens when running with worker threads, so I'm guessing that somehow a transaction is getting committed/ended twice on different threads? Is lmdb-store meant to be thread safe?

I'm still trying to come up with a minimal reproduction and debug this. But any pointers you have would be helpful. Thanks!

kriszyp commented 3 years ago

Yes, it is definitely meant to be thread-safe, although admittedly we use lmdb-store with child processes (rather than worker threads), so it gets more heavily tested in that scenario, but certainly would like to ensure it is working well with threads. A few questions: Are you on the latest version of lmdb-store (it looks like it from the line numbers)? Are you writing data with the new transactionAsync, or using put/remove? Any non-default options when starting the database/store? Anyway, I will see if I can reproduce locally.

devongovett commented 3 years ago

Yes, I built locally from master. Currently using put. I haven't tried transactionAsync yet. The options I'm using look like this:

lmdb.open(cacheDir, {
  name: 'parcel-cache',
  encoding: 'binary',
});

Thanks!

kriszyp commented 3 years ago

Also, I have been using the benchmark tests as my main means of pressure testing threads (and I just updated them to make sure they are fully running multi-threaded tests), which is run with npm run benchmark. However, I haven't been able to reproduce any double frees or any other types of crashes with it. But if you have any ideas for operations to add for testing, be happy to do it.

devongovett commented 3 years ago

Well I've put in many hours into debugging this so far. A preliminary solution that seems to work is to revert this commit: https://github.com/LMDB/lmdb/commit/cb256f409bb53efeda4ac69ee8165a0b4fc1a277. That appears to have been meant to address a leak: https://bugs.openldap.org/show_bug.cgi?id=9155.

One thing I have noticed is that there seems to be a significant amount of differences between mdb.c in this repo vs upstream. I'm not sure if that's just because it's outdated or whether more changes have been introduced here. But perhaps one of these differences may be related? Or it might actually be a bug in LMDB itself, but I find that more unlikely as I imagine many people are using it in multithreaded environments. I did try to replace mdb.c in lmdb-store with the upstream version, but that caused other issues.

I'm not really very familiar with the code, and still sort of wrapping my head around it. But it seems to me that there is a single pre-allocated write transaction that is reused here. mt_spill_pgs is allocated on demand here, but only if it is not NULL. When it is freed here, it is not also reset to NULL. Is it possible that this pre-allocated transaction gets reused later, and the freed pointer also gets reused but it now points to nowhere? I see it does get reset to NULL when a new transaction begins here but only when there is a parent, which seems to line up with the if statement in mdb_txn_end that was changed in the commit I reverted.

Sorry for the brain dump and if this was really a question for the upstream project.

kriszyp commented 3 years ago

Thank you for the great research! I also have been trying to reproduce this with no luck, but this gives some valuable direction. My initial hypothesis of what may be going wrong is that the mdb_midl_free(txn->mt_spill_pgs) call on line 3610 is actually taking place after the transaction mutex is unlocked (line 3602), which may produce a race condition (which would line up with this only occurring with threading), where another thread could possibly start and set the txn-mt_spill_pgs to a new instance before the free call takes place, and then the first thread would end up calling free mdb_midl_free on the same spill pages list as the new transaction in the second thread. At first glance it seems very unlikely for the second thread to actually win that race (has to go through a whole transaction and start spilling to create a new instance, while the first thread merely has to run the next line of code), but maybe there is some OS variation in possibly yielding thread execution after a mutex unlock or something that triggers this in rare cases.

Anyway, this hypothesis might be completely wrong. I am going to try to exploit it with some intentional delays to see if I can reproduce this issue though and get back to you. Thanks again for the help!

Also, maybe I can try to put together a document of the changes I have made to mdb.c. Some of them are minor bug fixes I intend to submit as bug fixes upstream (just haven't gotten around to it yet). Some of the other stuff was me just muddling through differences in C and C++ to experiment with some things. The most significant difference though is the on-fly-resizing during transactions that lmdb-store implements. And this does have threading implications since new memory maps can be allocated on the fly in any thread; basically this is handled by simply never un-mapping previous memory maps. I actually don't think this is related at all to this issue though, at least if my hypothesis is anywhere close to correct.

kriszyp commented 3 years ago

If this hypothesis is correct, the solution may simply be to move the mdb_midl_free(txn->mt_spill_pgs) call to line 3589, I think (if you want to give that a try).

kriszyp commented 3 years ago

Ok, I believe I have successfully reproduced this issue by introducing some delay/sleeps in between the unlock and the free. This does seem to be a real race condition, and I can verify that a double free can and does occur under the right conditions. And moving the free call, as mentioned above, does indeed seem to fix it as well. I will get a version published with a fix soon.

BTW, I am curious if there are any peculiarities to your system that makes this issue be manifested. As mentioned, with both threads running, it seems rather extraordinary for the second thread to win the race to changing the mt_spill_pages pointer before the first free call. And I assume that is why this has never been noticed on any other multi-threaded LMDB applications. I wonder if there is maybe some limiting/throttling of CPU cores on your system that makes it more likely for the first thread to actually yield execution immediately after unlocking the mutex? Anyway, regardless, it is a legit bug, and I will get it fixed and submit a patch upstream to LMDB itself. Thanks for the great report and in-depth investigation!

devongovett commented 3 years ago

That's great! In my brief testing so far this appears to fix the issue for me. 🥳

I'm running on an M1 MacBook Air, which may be somewhat peculiar. I did notice that I am able to reproduce the issue much more easily if I start the process and then switch to another application and do something somewhat intensive, e.g. refresh a browser tab. This may support your theory of CPU throttling, or at least more aggressive time slicing.

@wbinnssmith also saw this issue on an Intel MacBook Pro though, but it sounded like it might have happened after a significant amount of writes. Maybe macOS related in general?

For context, we are looking to switch from a filesystem-based cache in Parcel to LMDB. We're seeing some pretty impressive performance wins from doing that (~20% faster builds).

Thanks so much for your help with this issue!

kriszyp commented 3 years ago

Fascinating. I was actually just wondering if it would be possible to find a computer with M1 silicon to test lmdb-store, and see if it has any issues. Thanks for doing that for me, looks like we may found an issue! (I do actually test with an old MBP that I have, and never saw this with the multi-threaded benchmark tests, but my MBP is probably from circa 2008 or so, so may have not be that close to modern macs with subtle thread timing).

wbinnssmith commented 3 years ago

I just ran lmdb-store@c346d2c with Parcel on an Intel Mac and unfortunately the pointer is still being freed and crashing the program. It does take some time to reproduce, running across multiple threads and sometimes for many tens of seconds, but it does happen. @kriszyp would an automated reproduction of the issue help even if it's in the much larger context of Parcel?

kriszyp commented 3 years ago

Sure, I would be happy to check out any reproduction of the issue, even in a larger context.

wbinnssmith commented 3 years ago

Okay, looks like c346d2c actually does indeed fix it 🎉 .

It seems like I made a mistake symlinking in my build of lmdb-store. Sorry for false alarm! Thanks so much for diving into this @kriszyp 🙇

kriszyp commented 3 years ago

Fix published in v1.5.2.

devongovett commented 3 years ago

Thanks again @kriszyp!