LMDB write scalability problem

jubax commented 7 years ago

I wrote a benchmark (based on my my real-world demands) which is doing the following:

Start with an empty LMDB database
Create two databases (one for MD5 hash to integer and one for integer to string (less than 20 characters) mappings)
Load data of different sizes (each object will add one key/value pair to both databases)
The whole load is executed in a single write transaction.

At first I'm getting good results, but when I come to larger numbers of data the performance deteriorates significantly.

Time to load the data:

1M 12s 934ms
2M 23s 962ms
4M 48s 579ms
8M 1m 55s 623ms
16M 6m 57s 487ms
32M 17m 8s 151ms
64M 40m 36s 137ms
128M 1h 36m 309ms
256M 5h 4m 56s 973ms
512M >12h and still running

You see the numbers always double, but the time to load gets worse as the size of the data increases.

The benchmark is running on Linux 4.10.0-35-generic with 64GB RAM and a HDD.

The same benchmark running on a macOS laptop (with a SSD) has better numbers but also gets significantly slower with a larger amount of data.

So my question is: Is this an expected behavior of LMDB? I'm just wondering if I can use LMDB for multiple billions of key/value pairs or if I'm reaching the limits (maybe the main memory is not sufficient)?

Note that I did not do further analysis, e.g.

if the behavior is reproducible with a single database instead of two databases
if committing after (say) 1M key/value pairs will improve load performance
...

Any ideas?

krisskross commented 7 years ago

We have a similar use case where I load several 100M entries. I would suggest you try asynchronous writes that doesn't fsync after each commit: EnvFlags.MDB_MAPASYNC, EnvFlags.MDB_NOSYNC, EnvFlags.MDB_NOMETASYNC. It pretty easy to find appropriate commit sizes - just double the batch size until you find where throughput doesn't increase. And remember to do a manual fsync when done writing Env.sync(true).

You can also get a quite significant boost by sorting the entries before insertion and use MDB_APPEND for appending records at the end of the database.

jubax commented 7 years ago

Thanks for your answer. I will experiment a bit with the flags you mentioned. However, I am not sure this will actually help:

1) Each dataset will be added in a single write transaction. E.g. the 256M dataset will add 256 million key/value pairs to both databases in one transaction (this is also my use case, so I cannot split the large transaction in multiple smaller ones). So I assume no syncing is done while the transaction is running and the given flags have no effect in this case. Is this assumption correct?

2) Regarding MDB_APPEND: I got the impression (although this is never explicitly mentioned in the documentation) that I cannot use this option on a non-empty LMDB database because not all new values are sorted after the existing entries. Can I use this option when executing write operations on a database which already contains data?

benalexau commented 7 years ago

You mentioned 512M entries. Unless using MDB_APPEND, LMDB will allocate a 2,030 byte page per entry and in turn that's a database size around 967 GB. As such you're at about 15 times your machine's RAM and disk IO is likely to be the bottleneck.

Just on MDB_APPEND, this will suppress the default LMDB behaviour of reserving space between consecutive entries. As such it offers a smaller overall database size (should you have pre-sorted keys available in the first place) but the trade-off is slower future inserts if MDB_APPEND can no longer be used. I just note this optimisation here in case you try MDB_APPEND.

If we take your 256M entry point of 5 hours, that's a database of around 484 GB in circa 18,000 seconds. Or about 28 MB/second. While that's well below the sequential RW performance of enterprise-grade magnetic drives, seek time overhead would be material for the random IO that LMDB is requesting for unsorted entry insertion. It would be interesting to review how well an SSD drive operated for your benchmark to help understand how much of the performance degradation can be attributed to the drive IO.

More generally as a rule of thumb if your workload requires stronger write-time throughput you should consider an LSM-based database like RocksDB. If your workload requires stronger read-time throughput you're probably better off with a B+ tree database like LMDB. But there's no free lunch with either approach; the LSM will require tuning to get the optimal merge sizes and you'll have a lot of file handles in use plus write amplification in background threads. B+ trees will be slower to insert, but once you've paid that price you have very fast reads with just one file handle and zero background worker threads in use.

You also mentioned MD5, which got me thinking whether you could be plain old CPU or memory bound due to the hashing approach or GC (allocation, deallocation, generations etc). The simplest approach is load visualvm and take a peek at the GC overhead. If you find a problem, I suggest trying to use https://github.com/OpenHFT/zero-allocation-hashing for your hashing and maybe just a byte[] as your value. While it's unlikely to be the issue you're facing, GC destroys the throughput of numerous "high performance" libraries I've published benchmarks for here on GitHub (hashing, databases, networks). It'd be worth checking on your CPU and GC behaviour just in case.

jubax commented 7 years ago

Thanks for your detailed comment. I'm still analyzing my problem and I think part of the problem is that I'm using multiple databases. I just re-run my benchmark, but I only used one database (id-to-string) and I'm getting:

1M      9s          89.1MiB     [entries:1000000][depth:3][pageSize:4096][branchPages:80][leafPages:22728][overflowPages:0]
2M      17s         267.5MiB    [entries:2000000][depth:3][pageSize:4096][branchPages:158][leafPages:45455][overflowPages:0]
4M      34s         624.2MiB    [entries:4000000][depth:4][pageSize:4096][branchPages:317][leafPages:90910][overflowPages:0]
8M      1m8s        1.2GiB      [entries:8000000][depth:4][pageSize:4096][branchPages:631][leafPages:181819][overflowPages:0]    
16M     2m17s       2.4GiB      [entries:16000000][depth:4][pageSize:4096][branchPages:1260][leafPages:363637][overflowPages:0]
32M     4m31s       4.9GiB      [entries:32000000][depth:4][pageSize:4096][branchPages:2518][leafPages:727273][overflowPages:0]
64M     9m1s        9.8GiB      [entries:64000000][depth:4][pageSize:4096][branchPages:5035][leafPages:1454546][overflowPages:0]
128M    18m22s      19.5GiB     [entries:128000000][depth:4][pageSize:4096][branchPages:10068][leafPages:2909091][overflowPages:0]
256M    37m         39.0GiB     [entries:256000000][depth:4][pageSize:4096][branchPages:20134][leafPages:5818182][overflowPages:0]
512M    1h16m29s    78.0GiB     [entries:512000000][depth:4][pageSize:4096][branchPages:40266][leafPages:11636364][overflowPages:0]
1024M   2h34m43s    156.1GiB    [entries:1024000000][depth:4][pageSize:4096][branchPages:80529][leafPages:23272728][overflowPages:0

As you can see the write performance is pretty good. Also note the size of the database.

I will add another comment as soon as I understand more about the problem.

spyngas commented 6 years ago

Hi - Looking at the Performance Benchmark code, please correct me if I am wrong but the write performance achieved is when all the data updates are written as part of single transaction. If I understand correctly, what this means is any readers looking for the data being written cannot actually read the data until the transaction is set to complete by the writer i.e. only when all of the updates are written. This is not what you would like to see in a practical scenario as you would like the readers to be able to immediately see what has been written.

benalexau commented 6 years ago

@spyngas, you are correct in that all databases have been tested in a single transaction scenario. I felt this was a reasonable approach as it shows the maximum performance each database can deliver with non-stop workloads, as this forms a major focus of my use cases (algorithmic trading back-tests). There are other benchmarks around if you'd like to see a mixture of reads and writes (for example YCSB).

spyngas commented 6 years ago

Thanks for the response. If the benchmark code is changed such that a transaction is committed after every write, the LMDB write performance is not that great. So, would it be fair to say LMDB cannot be really used in scenarios where you need Fast write - Fast Read responses for small data messages.

jubax commented 6 years ago

In the meantime I executed further benchmarks (some of the text below is from by mails to the LMDB mailing list - still unclear how the benchmarks could be explained).

1) Windows/Linux differences

When I add 1 billion key/value pairs (16 byte MD5) to the LMDB database (in a single transaction (but I also get similar results when I add the same data in multiple transactions)) I get the following results:

Windows, without MDB_WRITEMAP: 46h Windows, with MDB_WRITEMAP: 6h (!) Linux (ext4), without MDB_WRITEMAP: 75h Linux (ext4), with MDB_WRITEMAP: 73h

The root cause for the Linux/Windows difference is still unclear. I suspected that sparse files on Linux are simply slower than pre-allocated files (LMDB/Windows currently supports only pre-allocated files), but the LMDB inventor Howard does not seem to think so.

I would still like to test it, but I'm not sure when I will find time for it.

2) Insert behavior Linux

I insert 1 billion key/value pairs (the key is again a MD5 hash).

In the beginning the data is inserted relatively fast:

1M/1000M took:3317 ms (3s 317ms)

Then the insert performance deteriorates gradually. After inserting 642M entries inserting 1M entries takes more than 5 minutes:

642M/1000M took:305734 ms (5m 5s 734ms)

At this time the database size (data.mdb) is about 27GiB. The filesystem buffer cache has about the same value, so I assume most pages are cached. Linux still reports 28G free memory.

A short analysis with perf seems to indicate that most time is spent in mdb_page_spill

Children   Self
96,45%     0,00%   lmdbjava-native-library.so  [.] mdb_cursor_put
96,45%     0,00%   lmdbjava-native-library.so  [.] mdb_put
96,45%     0,00%   jffi8421248145368054745.so (deleted) [.] 0xffff80428d388b3f
60,43%     2,61%   lmdbjava-native-library.so  [.] mdb_page_spill.isra.16
47,39%    47,39%   lmdbjava-native-library.so  [.] mdb_midl_sort
26,07%     0,24%   lmdbjava-native-library.so  [.] mdb_page_touch
26,07%     0,00%   lmdbjava-native-library.so  [.] mdb_cursor_touch
25,83%     0,00%   lmdbjava-native-library.so  [.] mdb_page_unspill
23,22%     0,24%   lmdbjava-native-library.so  [.] mdb_page_dirty
22,99%    22,27%   lmdbjava-native-library.so  [.] mdb_mid2l_insert
11,14%     0,24%   [kernel.kallsyms]           [k] entry_SYSCALL_64_fastpath
10,43%     0,47%   lmdbjava-native-library.so  [.] mdb_page_flush
 9,95%     0,00%   libpthread-2.24.so          [.] __GI___libc_pwrite
 9,72%     0,00%   [kernel.kallsyms]           [k] vfs_write
 9,72%     0,00%   [kernel.kallsyms]           [k] sys_pwrite64
 9,48%     0,00%   [kernel.kallsyms]           [k] generic_perform_write
 9,48%     0,00%   [kernel.kallsyms]           [k] __generic_file_write_iter
 9,48%     0,00%   [kernel.kallsyms]           [k] ext4_file_write_iter
 9,48%     0,00%   [kernel.kallsyms]           [k] new_sync_write
 9,48%     0,00%   [kernel.kallsyms]           [k] __vfs_write
 9,24%     0,00%   lmdbjava-native-library.so  [.] mdb_cursor_set
 8,06%     0,47%   lmdbjava-native-library.so  [.] mdb_page_search
 7,35%     0,95%   lmdbjava-native-library.so  [.] mdb_page_search_root
 4,98%     0,24%   lmdbjava-native-library.so  [.] mdb_page_get.isra.13

The documentation about mdb_page_spill says (as far as I understand) that this function is called to prevent MDB_TXN_FULL situations. It actually seems my transaction is simply too big. However, see 1) for an example where the same scenario is pretty fast under Windows.

Note that a similar benchmark with 4 byte integer keys took only 2h34m for 1000M entries (the integer keys were sorted, but I did not use MDB_APPEND).

Any ideas on these results?

I would love to see someone execute similar benchmarks to verify my results. Linux analysis and tuning tips are also welcome.

Anyway, my current conclusion is that you have to verify carefully that your scenario actually works with LMDB on your chosen hardware/OS combination.

Unfortunately there are not many other databases/kv stores which actually support ACID transactions and are still fast (e.g. RocksDb has a fairly limited transaction model).

benalexau commented 6 years ago

Sorry, I cannot explain the Windows vs Linux differences. They are surprising and I'd suggest you ask on the LMDB mailing list for suggestions.

Note that a similar benchmark with 4 byte integer keys took only 2h34m for 1000M entries (the integer keys were sorted, but I did not use MDB_APPEND).

I noticed something similar in my own tests. It's not just performance where you'll see a benefit from inserting in sequential order, but also in terms of storage used. @hyc described it as follows:

When we do inserts in sequential order, we bias the split so that the new page has more room than the old page, on the assumption that inserts will continue to be sequential. When we do sequential inserts using MDB_APPEND, there is no page split at all - we just allocate a new page and fill it, sequentially. These behaviors are both unique to LMDB.

For a random insert pattern, you'll be left with a lot of half-full pages. This is normal for all Btrees, and it's intentional. It leaves room for more future inserts, and minimizes the total number of page splits in a sustained random insert workload.

If you take a DB that was generated sequentially, with fully packed pages, and then start adding new records to it in random order, it will be a lot slower because so many inserts will require page splits.

More generally I'd say you have to know what your workload requirements are and make the necessary trade-offs between write performance, read performance and storage consumption. For example I've found it useful to run an end of day process that repacks the data from a "scratch" LMDB database into the permanent LMDB database. The repacking applies integer compression and also MDB_APPEND to save very significant amounts of space.

kiril-kirov commented 5 years ago

@jubax probably long forgotten issue, but still - any findings on the major performance difference between Linux and Windows? I just got into the same situation between Windows and MacOS, but I guess it's the same as with Linux, being both Unix based. I can also see that mdb_put takes really long time on MacOS, while it's super fast on Windows.

benalexau commented 4 years ago

They are surprising and I'd suggest you ask on the LMDB mailing list for suggestions.

I will close this ticket given it has now been 2.5 years since my above comment and I don't think there's anything specific to the LmdbJava Benchmarks or LmdbJava that would contribute to differences in write performance between the various operating systems. Additionally there have been multiple new versions of LMDB in that period (in turn bundled in LmdbJava) and these differences may no longer apply.

lmdbjava / benchmarks

LMDB write scalability problem #9