wbolster / plyvel

Plyvel, a fast and feature-rich Python interface to LevelDB
https://plyvel.readthedocs.io/
Other
531 stars 76 forks source link

Bloom filter #144

Closed mrx23dot closed 2 years ago

mrx23dot commented 2 years ago

Based on doc I interpret bloom_filter_bits to be applied once to entire DB. But when I do 128MB bloom/DB:

db = plyvel.DB(DB_DIR, bloom_filter_bits=128*(1024**2)*8, create_if_missing=True, compression=None, paranoid_checks=False)

it takes up +12GB RAM

bloom_filter_bits=1073741824 # same result bloom_filter_bits=1*(1024**2) # same result

doc say: bloom_filter_bits (int) – the number of bits to use for a bloom filter; the default of 0 means that no bloom filter will be used

So either it's not working or doc should say that's per record bases.

mrx23dot commented 2 years ago

I would be also nice to mention if it needs to be the same for writing the DB, or just reading it.

wbolster commented 2 years ago

i'm not convinced this is actually plyvel specific, since plyvel passes the number directly to NewBloomFilterPolicy:

options.filter_policy = NewBloomFilterPolicy(bloom_filter_bits)

...for use in the options passed to leveldb:

st = leveldb.DB_Open(self.options, fsname, &self._db)

so i guess you either misunderstand how it works (sorry, don't know myself), or leveldb is acting in an unexpected way.

wbolster commented 2 years ago

you're passing a huge value

>>> 128*(1024**2)*8
1073741824

while ‘a good value’ is a mere 10 bits. it's per key:

// Return a new filter policy that uses a bloom filter with approximately
// the specified number of bits per key.  A good value for bits_per_key
// is 10, which yields a filter with ~ 1% false positive rate.
//
// Callers must delete the result after any database that is using the
// result has been closed.
//
// Note: if you are using a custom comparator that ignores some parts
// of the keys being compared, you must not use NewBloomFilterPolicy()
// and must provide your own FilterPolicy that also ignores the
// corresponding parts of the keys.  For example, if the comparator
// ignores trailing spaces, it would be incorrect to use a
// FilterPolicy (like NewBloomFilterPolicy) that does not ignore
// trailing spaces in keys.
LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key);

see https://github.com/google/leveldb/blob/main/include/leveldb/filter_policy.h#L54-L68

mrx23dot commented 2 years ago

'per key' makes sense, thank you!

wbolster commented 2 years ago

cool, added it to the docs as well