jankotek / mapdb

MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java database engine.
https://mapdb.org
Apache License 2.0
4.87k stars 873 forks source link

Using Bloom Filters to optimize overflowing HTreeMap reads #1032

Open Aravind-Suresh opened 9 months ago

Aravind-Suresh commented 9 months ago

Consider the case when we use HTreeMap with overflow, i.e. have a top-level in-memory map (let's call it primary) and an underlying disk-based map (let's call it secondary) and have keys overflowing from primary to secondary periodically. Let's call this combination a hybrid map.

When we do hybrid.get(key), we would do a primary.get(key) and if that returns null, we end up doing a secondary.get(key) which could be expensive if the key is not present.

So for use-cases where we end up inserting new keys most of them, this secondary.get(key) would add additional latency most of the times.

A potential optimisation here is to wrap the read (get, containsKey) calls with an in-memory bloom filter (a set would work, but an in-memory bloom filter would be memory efficient) and return nulls if the bloom filter says no.

We tried this wrapping logic and it worked well for our use-case as >80% of times we tried inserting new keys into the map.

Happy to raise a PR for this once we've alignment -- the plan is to add a withBloomFilter(configs) to the HashMapMaker class and selectively enable this when required.