wbolster / plyvel

Plyvel, a fast and feature-rich Python interface to LevelDB
https://plyvel.readthedocs.io/
Other
518 stars 75 forks source link

Duplicated values for one key still stored on disk #142

Closed mrx23dot closed 2 years ago

mrx23dot commented 2 years ago

When I store same key value pair 1million times, why do I see more than one instance in database file?

I understand why it happens with .log files, can this be turned off? But by why in .ldb ?

import plyvel
db = plyvel.DB(db_dir, create_if_missing=True, compression=None,paranoid_checks=False)
for i in range(1000000):
    db.put(b'1'*20, b'2'*32)

2022-04-06 10_36_29-000005 ldb

Looks like only value is stored multiple times, even though I don't use branching/versioning.

How can I make this more efficient? (no write when value is the same, overwrite when different)

pylev==1.4.0 Python 3.7.0 win10

wbolster commented 2 years ago

this is certainly not a plyvel issue.

this is how leveldb and log structured merge trees (lsm) work in general. ‘make more efficient’ implies that it is inefficient now but i'm not sure why that would be the case.

mrx23dot commented 2 years ago

By 'more efficient' I mean it could check if it has the same value at lower level, instead of me having to check it over slow Python for each key I want to insert.

wbolster commented 2 years ago

i suggest you read up on how leveldb and lsm trees work. duplicate values will be compacted away. there are also various levels of in-memory caches involved, both in leveldb and (usually) the operating system