opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.68k stars 1.79k forks source link

[Tiered Caching] [Milestone 1] KeyLookup store for disk cache #10309

Open sgup432 opened 1 year ago

sgup432 commented 1 year ago

Is your feature request related to a problem? Please describe. Refer main https://github.com/opensearch-project/OpenSearch/issues/10024 and https://github.com/opensearch-project/OpenSearch/issues/10870

As part of disk based caches, we need to check whether a key is present on disk or not and extract value accordingly. As it can happen that we unnecessarily perform a disk seek even when the key is not present. We can solve this by storing disk cache keys in memory in an optimized way.

Details As part of tiered caching milestone1, we are looking to add a disk tier. Where essentially we will be spilling evicted items from in-memory cache to this disk tier.

So whenever a request comes, we need to check whether this request is present on either of caches i.e. in-memory or disk. As doing a disk GET(to know whether request is present or not) every time might be expensive and will add to the overall search latency. So to solve this, we are trying to maintain disk keys in-memory itself. To do this, instead of storing the whole key in memory(which can be expensive), we will just store the key hashcode(integer) in memory using this key store implemenation. Obviously there might be false positive cases but it should be few. Also the actual keys will be stored on disk itself.

To implement this key store, we were looking for different data structures like bitset array, roaring bitmap etc. Bitset can be expensive memory wise considering how it works. As for example, if we have only one key which is a large number N, we need to allocate that much memory for bitset array to set Nth bit index for a number N.

On other hand, roaring bitmap is more of a compressed bitset which works better memory wise but relatively slower. Here too we need to be considerate of the max memory roaring bitmap will use, so we have added some logic around that.

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

msfroh commented 9 months ago

Reading through the linked PR, I think we need to clarify what the interface for the key lookup should be before diving into the implementation.

It sounds like the key lookup is supposed to look like a set, with add, contains, and remove operations, right?

Taking into account that the design specifies that values get added to the cache in batches, and eviction can reasonably be deferred (since you allow for false positives), your add and remove operations could take a collection of items to avoid needing to repeatedly acquire a write lock to mutate the key store.