wbolster / plyvel

Plyvel, a fast and feature-rich Python interface to LevelDB
https://plyvel.readthedocs.io/
Other
531 stars 76 forks source link

leveldb created from large batch writes temporarily grows too much from 'compacting': lesson - compact before moving to system with low storage #154

Closed qrdlgit closed 1 year ago

qrdlgit commented 1 year ago

I have a large leveldb created by plyvel using reasonably large batch writes.

I've copied it over to a new system for get only purposes - no writes / puts.

I use code like this:

db = plyvel.DB('./db.lvl', create_if_missing=True)
for enum_i,(k,v) in enumerate(db):
    kd = k.decode("utf-8") 
    vd = json.loads(v.decode("utf-8")) 

While iterating in this manner it has grown from 7GB to over 14GB, adding 100s of new files in the process. Maybe this is 'compacting', but it's threatening to use up all the limited space I have in that particular location, and it seems a bit unreasonable considering I'm just doing get calls.

Is it because I'm enumerating the data?

My work around is to treat it in batches, deleting everything and copying from over the entire database after each batch. Crazy, but I see no alternative.

wbolster commented 1 year ago

not sure why this happens.

try compacting the db perhaps?

this is not a plyvel behaviour btw, but a leveldb thing, so maybe you can find other avenues with more knowledgeable people… curious to learn more though 🙃

qrdlgit commented 1 year ago
>>> import plyvel
>>> db = plyvel.DB("./db_copy.lvl")
>>> db.compact_range()

Just keeps growing and growing... sigh. Already gone from 51 files to 705, and 'du' reports 7603876 -> 9256684

'compacting'... perhaps they should rename this process to make it more clear.

Odd that I can't find a lot of discussion about this problem. Does leveldb not get a lot of usage? It was easier to install than rocksdb.

qrdlgit commented 1 year ago

Note: I re-ran this on a separate system with more storage space and at the end of the compaction, it ended up OK (did increase by 600mb, but probably much more performant). 3704 files, however.

Note that it blew up all my storage space on the limited space system. I had 19GB free on that system, and it temporarily used it all up, starting with a 7GB and 51 files.

So, lesson learned - make sure you compact before copying to a limited storage space for usage.

My guess is that my massive batch writes were creating really un-optimal ldb files. Very fast to write, but not great for gets, and compaction fixes that, but not using temporary space in a very controllable / friendly way.

wbolster commented 1 year ago

you could also try increasing the block size. see notes about that here, which mention bulk scans explicitly:

https://github.com/google/leveldb/blob/main/doc/index.md