Take a look to libmdbx - Githubissues

erthink commented 3 years ago

Regards.

kriszyp commented 3 years ago

@erthink I took a stab at getting libmdbx to run with lmdb-store. There certainly looks like there are some useful features there. However, I did run into some issues. We are primarily use lmdb-store with our Windows servers, where LMDB has excellent performance, so some of my questions and concerns from my initial prototyping are related to using it on Windows:

First, I ran into some compilation errors, specifically mo_AcquireRelease and atomic_load32 in mdbx_suspend_threads_before_remap in lck-windows.c were undefined. I am not sure if I was doing something wrong with my building, but it seemed like a legitimate error (I copied code from core.c to get it to compile for testing).
Is the on-disk data format of libmdbx compatible with LMDB? And if so, which version (LMDB 0.9 has a different data format than the newer version on the mdb.master3 branch which is intended to be v1.0, I believe)?
As mentioned, performance is obviously one of the main reasons for using LMDB (specifically on Windows for us), but I noticed that libmdbx appears to be using FlushViewOfFile/FlushFileBuffers for syncing data on Windows which scales very poorly. On LMDB, this has been replaced with using write-through write (with overlapping) as it performs vastly better: https://git.openldap.org/openldap/openldap/-/commit/dfb3bbed656132456001c5aaca246fd4430e5ef5 Is this something that could be ported to libmdbx? Or is there another way to be notified of when data has been written to disk for a given transaction (an asynchronous notification would be great as well).
I noticed that when resizing occurs, open read transactions can block the resizing. For cursors that are open and connected to a read transaction, do they need to all be aborted, or can they just be reset, and renewed?
With database geometry, does size_upper actually define the map size or does the map size grow? I noticed that if I set the size_upper to 1TB for about 100 databases, I get an error: "The paging file is too small for this operation to complete." which seems to suggest that it is fully mapping the size_upper rather than progressively resizing, and maybe I will still have to vary map sizes for different size databases if I have a few hundred databases with some that are hundreds of GBs?

Anyway, thanks for the pointer and any thoughts you have on this.

erthink commented 3 years ago

I apologize for my long silence. I postponed the answer several times, because I had to finish something or fix it.

I ran into some compilation errors, specifically mo_AcquireRelease and atomic_load32 in mdbx_suspend_threads_before_remap in lck-windows.c were undefined

It was a bug and now it is fixed, an enough time ago.

Is the on-disk data format of libmdbx compatible with LMDB?

No, it is not compatible with any version of LMDB. Once upon a time, the format was the same, but it was changed in 2016-2017, and then frozen in 2017 (MDBX_DATA_VERSION == 2 since https://github.com/erthink/libmdbx/commit/61a3766e23673f662f288644ab457e17bd306e72). The LMDB's mdb.master3 branch appeared much later.

As mentioned, performance is obviously one of the main reasons for using LMDB (specifically on Windows for us), but I noticed that libmdbx appears to be using FlushViewOfFile/FlushFileBuffers for syncing data on Windows which scales very poorly. On LMDB, this has been replaced with using write-through write (with overlapping).

Ok, I will try done this within https://github.com/erthink/libmdbx/issues/224.

Or is there another way to be notified of when data has been written to disk for a given transaction (an asynchronous notification would be great as well).

The mdbx_env_sync_ex() could be used for polling or waiting. But there are no plans to implement a full-fledged asynchronous notification, since there is no portable, robust and clear way to do this without using an additional thread. But with such thread we just get more overhead to current available way using mdbx_env_sync_ex().

I noticed that when resizing occurs, open read transactions can block the resizing. For cursors that are open and connected to a read transaction, do they need to all be aborted, or can they just be reset, and renewed?

Briefly: all done reasonable and perfectly. More details:

A database file is never shrinks to less than the largest MVCC-snapshot used by one of a read transactions.
Until the upper size limit is changed the extension of a database file does not affect readers, in particular by using NtExtendSection() on Windows.
Thus, until the "geometry" changed by user explicitly, a resize (which is this case may perform automatic according to the specified "geometry") don't affects readers.
The resize can occur at the request of the user, or automatically in accordance with the specified parameters of the "geometry" of the database.
If the user explicitly changes the geometry by calling mdbx_env_set_geometry(), the library tries to minimize the impact on readers. Including suspending threads for the time of re-creating a memory mapping to change its size (which is required on Windows). Unfortunately, it is impossible to guarantee that such a change will be successful in all processes operates with a database (for instance, the virtual memory map of some process may be too full). In the worst case, a scenario is possible when the closed memory mapping cannot be restored, because the freed address space is already occupied by another thread and you will get unexpected SIGSEGV/ACCESS_VIOLATION in a reader' thread(s). This is not considered a issue since it is excluded on modern OSes that support mremap().

With database geometry, does size_upper actually define the map size or does the map size grow? I noticed that if I set the size_upper to 1TB for about 100 databases, I get an error: "The paging file is too small for this operation to complete." which seems to suggest that it is fully mapping the size_upper rather than progressively resizing, and maybe I will still have to vary map sizes for different size databases if I have a few hundred databases with some that are hundreds of GBs?

The size_upper define the maximal database size which must be transparently handling after the environment was opened, i.e. without re-opening it, etc. So the size_upper simultaneously defines the mapping size, i.e. define the address space reservation for a possible growth a database. When a environment opens corresponding memory mapping will be created, which reserves the necessary number of PTEs inside the OS kernel. For a huge size_upper many PTEs are necessary, which itself requires some RAM. So if you open a many huge DB then your system may out of memory since it just reserves a huge number of PTEs.

Nonetheless, I am not sure that the Windows kernel (it is a nightmare since pursues/adheres bug-as-feature) does not mistakenly try to reserve space in the swap file for mappings with expandable, but still small sections.

kriszyp commented 2 years ago

@erthink A couple more questions: I assume that when using the safe-nosync mode and making env-sync call, once finished will update the last known safely persisted txn (and updating a meta page?) so that prior free pages can be reclaimed? One could do a env-sync call after every commit, in a separate thread, as a way to asynchronously determine when commits are safely persisted, without blocking the next tax and maximize freed page reuse?

According to the documentation, it looks like write operations will return a thread mismatch error if performed on a different thread than the txn was started. Is that constraint necessary? In lmdb (contrary to its documentation), write operations can be performed on a different thread as long as they are synchronized and the commit takes place on the same thread as the txn is started.

erthink commented 2 years ago

I assume that when using the safe-nosync mode and making env-sync call, once finished will update the last known safely persisted txn (and updating a meta page?) so that prior free pages can be reclaimed? One could do a env-sync call after every commit, in a separate thread, as a way to asynchronously determine when commits are safely persisted, without blocking the next tax and maximize freed page reuse?

Yes, yes, yes. Moreover, the mdbx_env_sync_poll() is useful for such scenarios.

According to the documentation, it looks like write operations will return a thread mismatch error if performed on a different thread than the txn was started. Is that constraint necessary? In lmdb (contrary to its documentation), write operations can be performed on a different thread as long as they are synchronized and the commit takes place on the same thread as the txn is started.

This constraint was introduced to allow for more strict control, but there are no technical obstacles, just like in LMDB. See https://github.com/erthink/libmdbx/issues/200.

kriszyp / lmdb-js

Take a look to libmdbx #24