Excessive storage overhead

joepie91 commented 2 years ago

Hi,

While trying to build a tool on lmdb for querying linear datasets, I observed excessively high amounts of storage overhead - the database size on disk could easily grow to several times the size of the original CSV data, despite all data being stored in arrays, and using msgpack or CBOR encoding.

I don't have a simple reproducible testcase for that degree of overhead yet, but I did write a very simple reproducible case that yielded over 50% overhead (156.1MiB database size for 1 million records with 20 byte keys and 80 byte values). While this is significantly lower than what I was seeing in my project, this still seems to be much higher than the documented overhead from lmdb's own benchmarks, which works out to more like 1%.

Looking at the database file with a hex editor, there appear to be many large regions of null bytes. What is causing this amount of overhead? Is this an issue with lmdb-js specifically, or did something change in upstream lmdb that made the storage overhead much worse?

The testcase:

"use strict";

const lmdb = require("lmdb");
const crypto = require("crypto");
const tmp = require("tmp");

const ITERATIONS = 1_000_000;

let db = lmdb.open({ path: tmp.tmpNameSync(), encoding: "binary", keyEncoding: "binary" });

for (let i = 0; i < ITERATIONS; i++) {
    db.put(crypto.randomBytes(20), crypto.randomBytes(80));

    if (i % 500 === 0) {
        console.log(i);
    }
}

joepie91 commented 2 years ago

So I have a ~1.3 GiB file that has seemingly much higher overhead than this testcase, containing many large null byte regions. It looks like some kind of 4k pages with sometimes >90% null bytes, though I don't know enough about lmdb internals to know whether that's actually what I'm looking at.

I've uploaded the file here, as it only contains public GTFS data anyway: https://transfer.sh/lap4gq/data.mdb (the link will expire after 14 days though)

kriszyp commented 2 years ago

A few thoughts on this: First, I don't see where lmdb's benchmarks claimed only 1% storage overhead. The "load size" from the first table looks like it is a little over 8% by my calculations. And I believe this is an unusually compact storage for LMDB, since it almost certainly was done with sequential loading, leading to mostly "full" b-tree nodes. For random writes (like your example), b-trees will be randomly splitting and growing, so they will also be progressively growing from 50% full (which would be 100% overhead compared to used data), to 100% full (0% overhead), so I would expect an average/amortized overhead of 50% plus overhead for headers and such. And I think this is roughly what we are both seeing for creating table in a single transaction (because it is done in a single event turn). If you do this same thing with sequential keys, I get something more like 110MB with code like your test case.

Once you start dealing with multiple transactions, and potentially having read txns, things start getting a lot more complicated and less compact. This is at least partially because of how LMDB handles free space, and won't reclaim/reuse free space of data from prior transactions if that transaction is still potentially in use or there has not been a confirmation that a prior transaction has been flushed to disk. I did notice one bug in regards to this, that I have fixed with the referenced commit, which effects reusing free space after restarting lmdb on the first transaction.

The preservation of free space also is increased with lmdb-js's newer overlapping sync functionality. This is intended to provide better performance, but can result in more outstanding writes without flush confirmation which leads to longer delays until free space is reclaimed. If you are more interested in storage than performance, you might try using overlappingSync: false.

However, also, admittedly some of storage growth is puzzling to me too. It doesn't seem like LMDB is doing a very good job at all at reusing free space even when it seems like it clearly should be. In such cases the free space database clearly has record of many free pages that can be used. I will try to investigate further why those aren't used more.

joepie91 commented 2 years ago

Thanks for the quick response!

First, I don't see where lmdb's benchmarks claimed only 1% storage overhead. The "load size" from the first table looks like it is a little over 8% by my calculations.

Ah, I was looking at the "Final Size" column, which is about 1%; I interpreted the "Load Size" to mean the total data written to disk (including data that was later overwritten), and the "Final Size" to be the size of the database after all was said and done. That interpretation may be wrong, I'm not sure.

And I believe this is an unusually compact storage for LMDB, since it almost certainly was done with sequential loading, leading to mostly "full" b-tree nodes. For random writes (like your example), b-trees will be randomly splitting and growing, so they will also be progressively growing from 50% full (which would be 100% overhead compared to used data), to 100% full (0% overhead), so I would expect an average/amortized overhead of 50% plus overhead for headers and such. And I think this is roughly what we are both seeing for creating table in a single transaction (because it is done in a single event turn). If you do this same thing with sequential keys, I get something more like 110MB with code like your test case.

This makes sense. Unfortunately, I am still seeing much worse overhead than this in my actual project (which is a write-once-and-then-read-only usecase, albeit with non-sequential key distribution), and more concerningly, the overhead is highly inconsistent.

I've now done three runs with the exact same dataset of around 1.2 GiB, as measured by tallying up the msgpack'd size of the key plus the msgpack'd size of the value, across all of the entries that were inserted. In each case, the data is inserted in the same order. The three runs resulted in database sizes of 3.0 GiB, 2.3 GiB, and 4.5 GiB(!) respectively, for what should be the exact same data. I can provide the database files and project code if it would be helpful.

Once you start dealing with multiple transactions, and potentially having read txns, things start getting a lot more complicated and less compact. This is at least partially because of how LMDB handles free space, and won't reclaim/reuse free space of data from prior transactions if that transaction is still potentially in use or there has not been a confirmation that a prior transaction has been flushed to disk. I did notice one bug in regards to this, that I have fixed with the referenced commit, which effects reusing free space after restarting lmdb on the first transaction.

The preservation of free space also is increased with lmdb-js's newer overlapping sync functionality. This is intended to provide better performance, but can result in more outstanding writes without flush confirmation which leads to longer delays until free space is reclaimed. If you are more interested in storage than performance, you might try using overlappingSync: false.

These testing results were with overlappingSync: true. Disabling overlappingSync unfortunately has such severe performance impact that it wouldn't be usable for my case :( Also, in my usecase, no (explicit) transactions are involved at any point, so I assume that those wouldn't explain the issue.

However, also, admittedly some of storage growth is puzzling to me too. It doesn't seem like LMDB is doing a very good job at all at reusing free space even when it seems like it clearly should be. In such cases the free space database clearly has record of many free pages that can be used. I will try to investigate further why those aren't used more.

Thanks!

atomictag commented 2 years ago

Actually what I found rather surprising is that not even drop() or clear()seem to reduce the size of the file at all. Also when I restart my app and run rootDb.getStats() the free values seem to never get reclaimed (if there are any).

kriszyp commented 2 years ago

not even drop() or clear()seem to reduce the size of the file at all.

That's right, LMDB does not have any ability to actually reduce the size of database file, it can only mark the pages of the database file as free for future use.

run rootDb.getStats() the free values seem to never get reclaimed

What were the free stats? The free-space database records are a list of free pages. So if you have 100MB of free space, that would be 25K of free-space entries (each 8 bytes), would take 200KB, or 50 pages, so 100MB would be represented by about 50 free-space pages.

My current strategy is that I would like to implement a throttle in transactions when there is certain amount of outstanding previous transactions waiting to be flushed, so free space can be more immediately reclaimed.

atomictag commented 2 years ago

My current strategy is that I would like to implement a throttle in transactions when there is certain amount of outstanding previous transactions waiting to be flushed, so free space can be more immediately reclaimed.

That sounds cool. I guess given how LMDB works a "stop-the-world" utility such as db.collect() would not really be feasible to programmatically reclaim unused free-space pages, right? (I thought about that after seeing #198)

kriszyp commented 2 years ago

I guess given how LMDB works a "stop-the-world" utility such as db.collect() would not really be feasible to programmatically reclaim unused free-space pages, right?

Yes. It is conceivable to use the compact backups (assuming I get them fixed) to reclaim free space, but like you mention, would require stop the database, making the backup, swapping it in.

kriszyp commented 2 years ago

I would like to implement a throttle in transactions when there is certain amount of outstanding previous transactions waiting to be flushed

This is implemented in v2.7.0. It doesn't ensure highly compact databases, but I believe it does improve free space reclamation, and it seemed to result in smaller size of the databases after multiple runs of the test code from this ticket.

atomictag commented 2 years ago

Just had a spin with 2.7.2 after #203 got solved and I can definitely see a significant improvement in my storage tests. Thank you!

kriszyp / lmdb-js

Excessive storage overhead #194