kriszyp / lmdb-js

Simple, efficient, ultra-fast, scalable data store wrapper for LMDB
Other
486 stars 39 forks source link

Take a look to libmdbx #24

Open erthink opened 3 years ago

erthink commented 3 years ago

https://github.com/erthink/libmdbx

Regards.

kriszyp commented 3 years ago

@erthink I took a stab at getting libmdbx to run with lmdb-store. There certainly looks like there are some useful features there. However, I did run into some issues. We are primarily use lmdb-store with our Windows servers, where LMDB has excellent performance, so some of my questions and concerns from my initial prototyping are related to using it on Windows:

Anyway, thanks for the pointer and any thoughts you have on this.

erthink commented 3 years ago

I apologize for my long silence. I postponed the answer several times, because I had to finish something or fix it.

I ran into some compilation errors, specifically mo_AcquireRelease and atomic_load32 in mdbx_suspend_threads_before_remap in lck-windows.c were undefined

It was a bug and now it is fixed, an enough time ago.

Is the on-disk data format of libmdbx compatible with LMDB?

No, it is not compatible with any version of LMDB. Once upon a time, the format was the same, but it was changed in 2016-2017, and then frozen in 2017 (MDBX_DATA_VERSION == 2 since https://github.com/erthink/libmdbx/commit/61a3766e23673f662f288644ab457e17bd306e72). The LMDB's mdb.master3 branch appeared much later.

As mentioned, performance is obviously one of the main reasons for using LMDB (specifically on Windows for us), but I noticed that libmdbx appears to be using FlushViewOfFile/FlushFileBuffers for syncing data on Windows which scales very poorly. On LMDB, this has been replaced with using write-through write (with overlapping).

Ok, I will try done this within https://github.com/erthink/libmdbx/issues/224.

Or is there another way to be notified of when data has been written to disk for a given transaction (an asynchronous notification would be great as well).

The mdbx_env_sync_ex() could be used for polling or waiting. But there are no plans to implement a full-fledged asynchronous notification, since there is no portable, robust and clear way to do this without using an additional thread. But with such thread we just get more overhead to current available way using mdbx_env_sync_ex().

I noticed that when resizing occurs, open read transactions can block the resizing. For cursors that are open and connected to a read transaction, do they need to all be aborted, or can they just be reset, and renewed?

Briefly: all done reasonable and perfectly. More details:

With database geometry, does size_upper actually define the map size or does the map size grow? I noticed that if I set the size_upper to 1TB for about 100 databases, I get an error: "The paging file is too small for this operation to complete." which seems to suggest that it is fully mapping the size_upper rather than progressively resizing, and maybe I will still have to vary map sizes for different size databases if I have a few hundred databases with some that are hundreds of GBs?

The size_upper define the maximal database size which must be transparently handling after the environment was opened, i.e. without re-opening it, etc. So the size_upper simultaneously defines the mapping size, i.e. define the address space reservation for a possible growth a database. When a environment opens corresponding memory mapping will be created, which reserves the necessary number of PTEs inside the OS kernel. For a huge size_upper many PTEs are necessary, which itself requires some RAM. So if you open a many huge DB then your system may out of memory since it just reserves a huge number of PTEs.

Nonetheless, I am not sure that the Windows kernel (it is a nightmare since pursues/adheres bug-as-feature) does not mistakenly try to reserve space in the swap file for mappings with expandable, but still small sections.

kriszyp commented 2 years ago

@erthink A couple more questions: I assume that when using the safe-nosync mode and making env-sync call, once finished will update the last known safely persisted txn (and updating a meta page?) so that prior free pages can be reclaimed? One could do a env-sync call after every commit, in a separate thread, as a way to asynchronously determine when commits are safely persisted, without blocking the next tax and maximize freed page reuse?

According to the documentation, it looks like write operations will return a thread mismatch error if performed on a different thread than the txn was started. Is that constraint necessary? In lmdb (contrary to its documentation), write operations can be performed on a different thread as long as they are synchronized and the commit takes place on the same thread as the txn is started.

erthink commented 2 years ago

I assume that when using the safe-nosync mode and making env-sync call, once finished will update the last known safely persisted txn (and updating a meta page?) so that prior free pages can be reclaimed? One could do a env-sync call after every commit, in a separate thread, as a way to asynchronously determine when commits are safely persisted, without blocking the next tax and maximize freed page reuse?

Yes, yes, yes. Moreover, the mdbx_env_sync_poll() is useful for such scenarios.

According to the documentation, it looks like write operations will return a thread mismatch error if performed on a different thread than the txn was started. Is that constraint necessary? In lmdb (contrary to its documentation), write operations can be performed on a different thread as long as they are synchronized and the commit takes place on the same thread as the txn is started.

This constraint was introduced to allow for more strict control, but there are no technical obstacles, just like in LMDB. See https://github.com/erthink/libmdbx/issues/200.