Intermittently with workers / high concurrency lmdb will fail to retrieve just added key until next pass of event loop

marcins commented 12 months ago

This is a bug that we've found in Parcel (https://github.com/parcel-bundler/parcel/issues/9121) that happens very intermittently. I managed to reproduce this in a minimal environment in this repo: https://github.com/marcins/lmdb-testbed/tree/lmdb-only - the lmdb-only branch has been modified from the Parcel repro to use LMDB directly, avoiding @parcel/cache which wraps LMDB.

The scenario is that in a worker thread, we do an await cache.put, and return the key to the main thread. In the main thread we do a cache.get. Intermittently this cache.get will fail. Retrying after a setTimeout(..., 0) it will always succeed. The worker and the main thread have separate cache instances (which are just created db.open(..) modelled on the Parcel implementation in https://github.com/parcel-bundler/parcel/blob/v2/packages/core/cache/src/LMDBCache.js.

I haven't had a chance to play with LMDB options to see if any of them help.. I've also tried things like await cache.fulfilled but that doesn't help. Interestingly removing some of the console.log from the repro makes the issue go away or at least not happen as frequently.

I haven't dug into the LMDB side yet to try and identify the issue, thought I'd raise it with what I know and the minimal repro that can be used to hopefully identify the root cause (which I believe is some sort of sync/timing issue in the lmdb library).

kriszyp commented 12 months ago

Without digging into the code too much, I would guess that the issue is that lmdb-js's caches read transactions from the time it is used until a setImmediate callback to avoid excessive read transaction resets. However, when receiving a message/notification from another thread/process, you should manually reset the read transaction to ensure the latest read transaction is using the last view of the database. You can do that by just calling db.resetReadTxn() (after receiving a notification/message from another thread that a write has finished and before attempting to read that data).

I can certainly dig into it more if this doesn't help.

lettertwo commented 12 months ago

I did a quick test by calling db.resetReadTxn() right before every read, and it does solve this for Parcel. I suppose we'll need to work out how to not do this before every read, though...

kriszyp commented 12 months ago

@lettertwo generally you do this after receiving some notification message of a change on another thread or when a thread completes, so hopefully that is less frequent than every read?

marcins commented 12 months ago

It's complicated a bit by our "Cache" abstraction, where LMDB backed is one implementation of the cache interface). Mainly in how to avoid making it too leaky of an abstraction for this case, but I think we're leaning towards hiding it a bit by adding a sync/refresh method to our Cache interface (which would no-op for other implementations) and calling it before reads in scenarios where write/read pairs happen close together. The other complication in our case, is that while there's a coordinating thread, the read and write could happen in two different threads that's not the coordinating thread (each thread/worker has it's own instance of LMDBCache / db connection).

In any case, thanks for pointing us to that method - it certainly appears that's our issue. I did look through the docs but I missed that as being relevant to our situation. I'll close this issue out.

kriszyp / lmdb-js

Intermittently with workers / high concurrency lmdb will fail to retrieve just added key until next pass of event loop #250