Closed mrx23dot closed 2 years ago
I would be also nice to mention if it needs to be the same for writing the DB, or just reading it.
i'm not convinced this is actually plyvel specific, since plyvel passes the number directly to NewBloomFilterPolicy
:
options.filter_policy = NewBloomFilterPolicy(bloom_filter_bits)
...for use in the options passed to leveldb:
st = leveldb.DB_Open(self.options, fsname, &self._db)
so i guess you either misunderstand how it works (sorry, don't know myself), or leveldb is acting in an unexpected way.
you're passing a huge value
>>> 128*(1024**2)*8
1073741824
while ‘a good value’ is a mere 10 bits. it's per key:
// Return a new filter policy that uses a bloom filter with approximately
// the specified number of bits per key. A good value for bits_per_key
// is 10, which yields a filter with ~ 1% false positive rate.
//
// Callers must delete the result after any database that is using the
// result has been closed.
//
// Note: if you are using a custom comparator that ignores some parts
// of the keys being compared, you must not use NewBloomFilterPolicy()
// and must provide your own FilterPolicy that also ignores the
// corresponding parts of the keys. For example, if the comparator
// ignores trailing spaces, it would be incorrect to use a
// FilterPolicy (like NewBloomFilterPolicy) that does not ignore
// trailing spaces in keys.
LEVELDB_EXPORT const FilterPolicy* NewBloomFilterPolicy(int bits_per_key);
see https://github.com/google/leveldb/blob/main/include/leveldb/filter_policy.h#L54-L68
'per key' makes sense, thank you!
cool, added it to the docs as well
Based on doc I interpret bloom_filter_bits to be applied once to entire DB. But when I do 128MB bloom/DB:
db = plyvel.DB(DB_DIR, bloom_filter_bits=128*(1024**2)*8, create_if_missing=True, compression=None, paranoid_checks=False)
it takes up +12GB RAM
bloom_filter_bits=1073741824 # same result bloom_filter_bits=1*(1024**2) # same result
doc say: bloom_filter_bits (int) – the number of bits to use for a bloom filter; the default of 0 means that no bloom filter will be used
So either it's not working or doc should say that's per record bases.