[Proposal] Tiered caching - OpenSearch

sgup432 commented 1 year ago

Co-Author: @jainankitk

Problem Statement

OpenSearch relies heavily on various caches to speed up the data retrieval process and thereby providing significant improvement in search latencies. This is usually done by caching query specific data in memory, avoiding multiple disk seeks and thereby saving resources(CPU/JVM) by not processing the query again. But cache size is limited by the amount of memory available on a node. In cases where we are dealing with larger datasets which can potentially be cached, this causes a lot of cache evictions/misses and potentially impacting performance as we need to process the query again causing high resource consumption.

Solution

Considering the limitations of in memory caches(as described in above section), there could be a significant performance gain if we consider a tiered cache solution. Tiered caching is basically a multi level cache with each tier having it’s own characteristics and performance levels. This will help us in adding more caching tiers so that we have the capability to cache a much larger dataset which is not limited by available memory.

Tiered caching options:

Disk backed Cache: This will serve as an extension to in-memory cache. It will be able to store additional data which may not fit in the memory. We can imagine this as a simple key value store on disk with ability to expire/evict items. It can provide much faster access times for search queries by fetching the results directly from disk and avoid expensive query recomputation which also takes a heavy toll on resources like CPU/JVM.
Off heap Cache: We can also consider to cache data off-heap potentially using direct byte buffer.
Remote Cache: This is another tier which can serve as an extension. Remote cache can be useful to save cost of storing the cached data in comparison to storing it locally on disk. It can help to cache much larger set. It comes with it's own additional network latency cost but still provide a significant for non cheap queries.

Use cases

Tiered caching option is not meant for every use case. Like for example, where customer is using non cacheable queries or have a search use case(does frequent updates). In such cases this option may not be feasible. We try to list down the use cases/scenarios in which this might be feasible.

OpenSearch domains where we see a lot of cache evictions for desired type of caches(request/query). In such cases we can overflow to lower tiered cache(like disk) and provide a much larger cache size.
- Users can further analyze to see if there is a good cache hit ratio for their existing caches. A very low Cache hit counts/ratio will tell us that either there many unique queries or data is getting refreshed/updated pretty quickly.
Log analytics use case where the indices are rolled over and older indices become read only. (Considering queries are cacheable).
Read-only indices

Details around Existing caches

IndicesRequestCache: Caches the whole request on a shard level. Not all queries are cacheable and it is decided using this logic. Overall DFS type search queries, range queries with now() date, request with size >0 are not cacheable.
- Key used in cache: Uses a composite key which is a combination of (index shard + query itself + IndexReader.CacheKey). These three uniquely define a key. IndexReader.CacheKey changes whenever refreshes or segment merge happens.
- Invalidation: Key used here in never updated to replace its old with new value. Though it can become stale in case index shard is deleted or a refresh happens(when data has changed). Stale keys are deleted using a background job called CacheCleaner which runs every minute(can potentially reduce this for faster cleanup).
IndicesQueryCache: This caches DocIDSet on a segment level for a shard. Only works for non scoring sub queries like filter queries. As it can cache sub filter queries, this can be used across queries. It utilizes some caching policy to decide whether it needs to cache the query or not.
- Key used: Uses IndexReader.CacheKey as above. Becomes stale after segment merge.

High level design

Approach: Spillover evicted items to a lower tier cache

Below diagram takes disk based cache as an example. But can be extended for other tiers as well. Here OS in-memory cache refers to the existing OpenSearch caches we currently have like RequestCache/QueryCache.

overflow_to_disk_1 (1)

Key components/features:

KeyLookup store(A): For tiers like disk, checking whether the key is present or not can be an expensive/unnecessary operation in case key is not present on that tier. Here we will maintain all cache keys for this lower tier in memory for fast lookup and avoid an extra hop. In case we don’t maintain such keys in memory, we might end up hitting lower tier many times where the key was not present and thereby increasing overall search shard latencies.
- We can potentially use a key hashcode along with a data structure like roaring bitmap to achieve this.

This component in below diagram describes the point where the key was not present in both in-memory or disk based cache(as an example) and the value needs to be loaded using default mechanism. In this case loading value means running the query and fetching results. overflow_to_disk_1-Page-2 (1) (1) Total cache miss: In case there is a total cache miss for a request i.e. result is not present in either in-memory or disk based cache, we will cache this request inside in-memory cache like the way it happens now. Evicted items will be spilled over to a lower tier based cache.

Query cost: This can be for example time taken by the query when it was loaded from disk during cache miss. X(configurable) is used as a threshold to decide whether it makes sense to store the query result in lower tier based cache or not as certain queries might be fast enough without it. For an example, a simple term query might be fast enough to not store it on disk based cache.
1. Batch I/O operations: In case of a disk based/remote cache, we can consider to batch IO operations to avoid cost to do expensive writes to disk for example which can add extra latency to the search path. We can later flush all values in the background by using a separate thread. We can limit number of batched items on basis of entries count(A) or total size(bytes)(B) whichever is breached first. Will be kept configurable.
Pulling entry from lower tier cache to on-heap/in-memory cache(C): Once we retrieve result from lower tier like disk based cache, we can consider to pull the entry into in-memory cache. For a start we can make this decision based on the frequency ie more frequently accessed item can pulled into in-memory. We can also limit this using a setting to disallow too many items being pulled at once.

Cache eviction strategies

OpenSearch uses LRU based eviction policies for in-memory caches(Request/Query cache). LRU is a popular eviction policy due to its simplicity and provides decent cache hit rate in common scenarios. But it is known to be not optimal as it can evict items which might have been required in the near future due to occasional burst of other items.

There are other eviction policies like ARC(Adaptive Replacement Cache), Window TinyLfu etc which are proven to perform better than LRU. But for initial phases, we will stick to existing LRU based policy and eventually consider different policies in later phases.

Cache cleanup

Below is being talked from RequestCache perspective.

Whenever a invalidation happens, stale keys are left hanging in the cache. Currently OpenSearch has a background job which runs every 1 minute to clean the stale keys lying in the cache. It keeps collecting stale keys by listening to IndexReader close events. We can use similar mechanism for lower tier cache as well.

Milestones

Milestone 1:

As part of this milestone 1, we are planning to cover Request cache as it is more of a simplistic cache.
We will provide disk tier support. We will plan to use an opensource embedded caching library for this.
We will reuse existing OpenSearch in-memory cache with LRU based eviction policy. Evicted items will be spilled over to disk based cache.

Milestone 2:

In this phase, we will also consider QueryCache(caches sub filter queries) and extend it to support tiered support.
Consider queries with size > 0 to be cached in request cache. As of now we will only cache queries in requestCache with size=0 by default.
Consider pulling disk based cache entries into memory based on frequency or other factor.

Milestone 3:

We will consider caffeine to replace in house in-memory OpenSearch cache. Caffeine uses window tinyLRU based eviction policy which has been proven to provide a better cache hit ratio.
Explore offheap store based cache.

sgup432 commented 1 year ago

We performed some POCs with disk tier based cache. Will add numbers here.

sgup432 commented 1 year ago

POC

We did POC with OpenSearch request cache.
We added a disk tier support so that any evicted items from in-memory cache can be spilled over to disk based cache.
We evaluated many embedded java libraries for our disk tier based cache support.
- We checked EhCache, caffeine, JCS, mapDb etc but eventually chose EhCache as it is widely used, proven to be performant, has rich features and provides disk tier support.

Performance testing

As part of performance testing, we divided this into two phases. As part of phase-1 we wanted to benchmark pure disk cache based latency vs existing in-memory cache latency. As part of phase-2, we tried to simulate a real scenario by adding a disk tier support to existing cache and running nyc_taxis like workload.

Phase-1

Goal: Benchmarking pure disk based cache latencies against OpenSearch in-memory cache.

Workload: We indexed nyc_taxis dataset and used nyc_taxis related queries. We used a variation of nyc_taxis query and ran it multiple times(resulting in cache hits except for first one) to benchmark latencies.

Setup:

We used a single node domain with c5.large instance with EBS attached.
Amazon managed opensearch version 1.3
Single index shard - nyc_taxis

Test-1

In test-1, we used a query and changed(increased) its range to make it more computationally expensive but keep the response size fixed to compare disk tier cache latencies vs in-memory cache.

Query was range filter query with aggregating avg and sum on 2 fields. Aggregation buckets not increasing as we increased the range below. _pickupdatetime was changed in below query {"size":0,"query":{"bool":{"filter":[{"range":{"pickup_datetime":{"gte":"2015-01-01 00:00:00","lte":"2015-01-01 11:59:59"}}}],"must_not":[{"term":{"vendor_id":"Vendor XYZ"}}]}},"aggs":{"avg_surcharge":{"avg":{"field":"surcharge"}},"sum_total_amount":{"sum":{"field":"total_amount"}}}}
Jan 1 to Jan 1, Jan 1 to Feb 1 etc represents range values we used for the query to make it more computationally expensive as can be seen from cache miss latency.
POC disk only was pure disk tier cache by plugging in Ehcache.

Conclusion:

As we can see, disk only latencies were ~1.5ms more than existing OpenSearch in-memory cache. Disk cache was also coming into picture which could have made it this fast.

Test-2

As part of this, we also wanted to see impact of disk tier cache latencies when we increase the response size as we increase/expand the range of query.

Query was range filter with aggregation buckets increasing as we increase the range of pickup_datetime below {"size":0,"query":{"bool":{"filter":[{"range":{"pickup_datetime":{"gte":"2015-01-01 00:00:00","lte":"2015-02-01 11:59:59"}}}],"must_not":[{"term":{"vendor_id":"Vendor XYZ"}}]}},"aggs":{"avg_surcharge":{"avg":{"field":"surcharge"}},"sum_total_amount":{"sum":{"field":"total_amount"}},"vendor_id_terms":{"terms":{"field":"vendor_id","size":100},"aggs":{"avg_tip_per_vendor":{"avg":{"field":"tip_amount"}}}},"pickup_location_grid":{"geohash_grid":{"field":"pickup_location","precision":5},"aggs":{"avg_tip_per_location":{"avg":{"field":"tip_amount"}}}}} }

phase-1-test-2_final

Summary:

As we go right, query is getting expensive(latency wise) and also response size is increasing.
POC disk only is disk tier cache and no in-memory support.
We also purged caches continuously via a background cron to compare latencies without disk cache involved. This gave us an upper bound of latencies.

Phase-2

Goal: Try to simulate a real workload and benchmark OpenSearch with and without tiered cache support.

Workload: We indexed nyc_taxis dataset and used nyc_taxis related queries. We picked up 6 nyc_taxis query, randomized its value to create cache hits/misses/evictions. We created a custom script(), which takes these 6 queries, runs them in a loop(for 2000 times with randomized values) in a multithreaded fashion.

Setup:

Single node domain with r5.2xlarge instance type.
In-memory cache limited to 1mb to keep it simple.
Disk tier cache of 100mb was added. (Whole workload generated less than 100mb of cache in an hour or so)
Read only workload
1 nyc_taxis shard

tiered_Caching_phase2_graphs

- p99 and p100 were closer consider those represented cache misses latencies.

Summary:

Total cache size ~12mb with tiered support. No evictions seen.
Duration of workload run was ~1hour including warmup iterations.

msfroh commented 1 year ago

For the Phase 2 tests, given that it's just running 5 queries, wouldn't any setup that allocates enough cache to accommodate those 5 queries (with no updates being processed) yield a significant improvement?

I'm a little concerned that increasing the cache to hold all the queries in the benchmark makes the benchmark less useful as a measure of code performance, since the code stops being used -- results are just served from the cache.

sgup432 commented 1 year ago

@msfroh

For the Phase 2 tests, given that it's just running 5 queries...

It is not just 5(actually 6) queries per se. We used 6 queries "template", randomized its field values and ran those in a loop(~4000 times) using 6 parallel clients(total ~72k queries) . So number of unique queries were much larger.

I'm a little concerned that increasing the cache to hold all the queries in the benchmark makes the benchmark less useful.

Tiered caching usefulness is to provide a much larger cache and for use cases where we are seeing some repeatable queries coming in(or a decent cache hit ratio). Currently in-memory cache size is limited to amount of memory available on node and it needs to evict items. So if we had unlimited memory available, we could have just used that. Plus due to our randomization logic, it might happen that query 1 which is sitting in cache is never called again.

This benchmark tries to mimic((with a smaller in-memory cache size for simplicity) a real workload where a domain has maybe for example 50% cache hit ratio(with existing in-memory cache as seen from requestCache node stats) and see performance by adding a disk tier support with the same workload. As with tiered support, we can store more queries, effectively increasing hit ratio and seeing gains.

Tiered caching might not make sense for cases where there is hardly any cache hit ratio or no evictions seen from existing in-memory cache.(As called out above).

Also regarding testing with updates, invalidation happens effectively during refreshes. RequestCache uses lucene IndexReader.CacheKey as one of the keys, so if you hit same query again after refresh, different key is generated/stored and making older key/value stale. So we can also replicate this scenario by randomizing query values to generate different keys.

Sample request stats from our experiments. Running same workload against below 2 use cases: With in-memory cache only (1mb size):

"request_cache": {
        "memory_size_in_bytes": 1046380,
        "evictions": 21231,
        "hit_count": 51137, // 70% cache hit ratio
        "miss_count": 22063
},

With additional disk tier support

"request_cache": {
          "total_memory_size_in_bytes": 11899573,
          "total_hit_count": 53559,
          "os_cache_memory_size_in_bytes": 1048555,
          "disk_cache_memory_size_in_bytes": 10851018,
          "total_miss_count": 10652, ~50% less
          "total_hit_count":62548 // 85% hit ratio
},

msfroh commented 1 year ago

Thanks @sgup432 for the added details!

I do still worry that any benchmarks will have difficulty measuring the added benefit of increased caching, since it heavily depends on the workload executed. Regardless of how many queries are in the benchmark, there will be some cache size that will fit them all, at which point the cache hit ratio is just a function of how many times the queries are executed (since they'll all get cached after the first run).

I'm definitely not trying to argue against tiered caching -- I think it's a neat idea to increase the size of the query cache. (I'm skeptical of adding a remote cache tier unless the index is readonly. That said, if the index is readonly, I think there could be some neat opportunities to precompute caches for frequently co-occurring clauses based on historical queries.) I am trying to say that the specific numbers from benchmarks probably can't be trusted (except to give a theoretical upper bound on potential speedup), though.

sgup432 commented 1 year ago

@msfroh

I do still worry that any benchmarks will have difficulty measuring the added benefit of increased caching, since it heavily depends on the workload executed.

Agree that it is dependent on the kind of workload being executed. We just used whatever we had i.e. picking up standard nyc_taxis queries, randomizing its values, turning on request cache(it is disabled by default) and run those queries in a loop to create a workload. It gives a good idea around the benefits we might see.

In a real scenario, a domain might see queries getting cached in the first run/miss but never getting a cache hit after that. It can happen either due to invalidation happening(on refresh) or user never calls the same query again. And that controls the cache hit ratio. In our benchmark, we tried to replicate this scenario through extent of randomization(increase/decrease the range) of query field values. Where increasing randomization takes us towards the worst case scenario where we generate a lot of unique queries(which doesn't get a hit after first run), and decreasing randomization takes us towards best scenario where we see a very high cache ratio. I was trying to replicate different cache hit ratios(by tweaking randomization of query field values) and seeing benefits with tiered caching. Above benchmark was one of them. But I am open to suggestions if you have better ideas.

The benefits of tiered caching are also dependent on the query took time. For workloads with very low query latencies(like simple term queries for example), it might not make sense to cache these on disk or use tiered caching in general as it might make performance worse. To handle this, we are taking QueryCost into consideration(as discussed in above design) so that we don't cache these kind of fast queries on disk. Also phase-1 results gives a good overview of the cost of using disk cache based latencies.

I'm skeptical of adding a remote cache tier unless the index is readonly.

Yeah, remote cache tier use case needs more thinking but in general it might be useful for readonly as you mentioned or cases where users have a much larger cache dataset which can be stored in some cheap remote store instead on local disk taking into consideration the added network latency.

For reference: This is the script we used to generate the workload.

sgup432 commented 1 year ago

@sohami @jainankitk @backslasht @ketanv3 @Bukhtawar

sandervandegeijn commented 8 months ago

Just read through this, impressive results. Might not be an issue, but I have run into issues with the searchable snapshots cache mechanism: https://github.com/opensearch-project/OpenSearch/issues/11676

Is the implementation for this done is such a manner that a disk that's filling up can never take down the cluster? :)

ansjcy commented 8 months ago

Interesting proposal and POC results! I'm really looking forward to seeing the improvements this feature can bring to OpenSearch :) One small question regarding the query cost estimation:

Query cost: This can be for example time taken by the query when it was loaded from disk during cache miss. X(configurable) is used as a threshold to decide whether it makes sense to store the query result in lower tier based cache or not

Do we have any high level ideas on how those estimation metrics/data will be collected? Do you think this can be done as part of Query Insights plugin? In which case we can reuse those data to present on the dashboard as well. Also does it make sense to automatically adjust this X value based on certain rules from our domain knowledge, and send this as "recommendations" to the customer (maybe a future improvement)?

kiranprakash154 commented 8 months ago

Hey @sgup432, are we on track for this to be released in 2.12 ?

sgup432 commented 8 months ago

@sandervandegeijn

Is the implementation for this done is such a manner that a disk that's filling up can never take down the cluster? :)

Yes, we are going to have a cap on disk cache size which would be kept configurable so that we don't run into such issues.

sandervandegeijn commented 8 months ago

Great! This will work together with the high watermark feature that's monitoring disk usage of the nodes?

Will the cache be implemented on the coordinating nodes or on all nodes?

sgup432 commented 8 months ago

Great! This will work together with the high watermark feature that's monitoring disk usage of the nodes?

We are going to take that into consideration ie not cache items to disk when disk utilization on that node is running low.

Will the cache be implemented on the coordinating nodes or on all nodes?

This proposal is meant to extend existing caches we have by adding a disk tier. Like Request cache, Query cache. All these caches doesn't work on coordinator node level but instead on a shard level lying on a date node.

sandervandegeijn commented 8 months ago

Great thanks!

opensearch-project / OpenSearch