materialsproject / maggma

MongoDB aggregation machine
https://materialsproject.github.io/maggma/
Other
38 stars 32 forks source link

Enhancement: Locking mechanism for file-based stores #832

Open Andrew-S-Rosen opened 1 year ago

Andrew-S-Rosen commented 1 year ago

As discussed in #828, most file-based database packages (including MontyDB in the already-implemented MontyStore) do not have any built-in protection against multiple Python processes (or threads) reading/writing to the same database at the same time. This makes them useful only for serial calculations and less suitable for high-throughput settings where the odds of a collision are very high.

Rather than relying on the external package to implement a file-locking system, we should introduce a file-locking mechanism within maggma that can be applied to all file-based data stores. py-filelock and portalocker are both good platform-agnostic options, with the former perhaps being slightly more active. There are built-in locking features in the MP monty package, but in my opinion we are better off using a battle-tested solution since they are usually light on the dependencies anyway (and the lock mechanism used in fireworks often caused headaches...).

I'm jotting this down so that I don't forget. I don't have plans to work on this right now, but I will likely need to implement it one day in the future.

munrojm commented 1 year ago

I like this idea

Andrew-S-Rosen commented 9 months ago

FYI: Here is what happens when two processes try to write to a montystore at the same time. It looks like montydb has a locking mechanism, but it doesn't support concurrent processes.

rkingsbury commented 9 months ago

I had started some work to replace mongomock with actual mongodb in MemoryStore (see #846 ). Since JSONStore is backed by MemoryStore, I wonder whether doing this could also address the locking issue?

We have had success using JSONStore to run atomate2 workflows in low throughput, but I'm sure we would encounter a similar problem in high throughput.