Open sgup432 opened 1 year ago
We performed some POCs with disk tier based cache. Will add numbers here.
As part of performance testing, we divided this into two phases. As part of phase-1 we wanted to benchmark pure disk cache based latency vs existing in-memory cache latency. As part of phase-2, we tried to simulate a real scenario by adding a disk tier support to existing cache and running nyc_taxis like workload.
Goal: Benchmarking pure disk based cache latencies against OpenSearch in-memory cache.
Workload: We indexed nyc_taxis dataset and used nyc_taxis related queries. We used a variation of nyc_taxis query and ran it multiple times(resulting in cache hits except for first one) to benchmark latencies.
Setup:
In test-1, we used a query and changed(increased) its range to make it more computationally expensive but keep the response size fixed to compare disk tier cache latencies vs in-memory cache.
Query was range filter query with aggregating avg and sum on 2 fields. Aggregation buckets not increasing as we increased the range below.
_pickupdatetime was changed in below query
{"size":0,"query":{"bool":{"filter":[{"range":{"pickup_datetime":{"gte":"2015-01-01 00:00:00","lte":"2015-01-01 11:59:59"}}}],"must_not":[{"term":{"vendor_id":"Vendor XYZ"}}]}},"aggs":{"avg_surcharge":{"avg":{"field":"surcharge"}},"sum_total_amount":{"sum":{"field":"total_amount"}}}}
Jan 1 to Jan 1, Jan 1 to Feb 1 etc represents range values we used for the query to make it more computationally expensive as can be seen from cache miss latency.
POC disk only was pure disk tier cache by plugging in Ehcache.
Conclusion:
As part of this, we also wanted to see impact of disk tier cache latencies when we increase the response size as we increase/expand the range of query.
{"size":0,"query":{"bool":{"filter":[{"range":{"pickup_datetime":{"gte":"2015-01-01 00:00:00","lte":"2015-02-01 11:59:59"}}}],"must_not":[{"term":{"vendor_id":"Vendor XYZ"}}]}},"aggs":{"avg_surcharge":{"avg":{"field":"surcharge"}},"sum_total_amount":{"sum":{"field":"total_amount"}},"vendor_id_terms":{"terms":{"field":"vendor_id","size":100},"aggs":{"avg_tip_per_vendor":{"avg":{"field":"tip_amount"}}}},"pickup_location_grid":{"geohash_grid":{"field":"pickup_location","precision":5},"aggs":{"avg_tip_per_location":{"avg":{"field":"tip_amount"}}}}} }
Summary:
Goal: Try to simulate a real workload and benchmark OpenSearch with and without tiered cache support.
Workload: We indexed nyc_taxis dataset and used nyc_taxis related queries. We picked up 6 nyc_taxis query, randomized its value to create cache hits/misses/evictions. We created a custom script(
Setup:
- p99 and p100 were closer consider those represented cache misses latencies.
Summary:
For the Phase 2 tests, given that it's just running 5 queries, wouldn't any setup that allocates enough cache to accommodate those 5 queries (with no updates being processed) yield a significant improvement?
I'm a little concerned that increasing the cache to hold all the queries in the benchmark makes the benchmark less useful as a measure of code performance, since the code stops being used -- results are just served from the cache.
@msfroh
For the Phase 2 tests, given that it's just running 5 queries...
It is not just 5(actually 6) queries per se. We used 6 queries "template", randomized its field values and ran those in a loop(~4000 times) using 6 parallel clients(total ~72k queries) . So number of unique queries were much larger.
I'm a little concerned that increasing the cache to hold all the queries in the benchmark makes the benchmark less useful.
Tiered caching usefulness is to provide a much larger cache and for use cases where we are seeing some repeatable queries coming in(or a decent cache hit ratio). Currently in-memory cache size is limited to amount of memory available on node and it needs to evict items. So if we had unlimited memory available, we could have just used that. Plus due to our randomization logic, it might happen that query 1 which is sitting in cache is never called again.
This benchmark tries to mimic((with a smaller in-memory cache size for simplicity) a real workload where a domain has maybe for example 50% cache hit ratio(with existing in-memory cache as seen from requestCache node stats) and see performance by adding a disk tier support with the same workload. As with tiered support, we can store more queries, effectively increasing hit ratio and seeing gains.
Tiered caching might not make sense for cases where there is hardly any cache hit ratio or no evictions seen from existing in-memory cache.(As called out above).
Also regarding testing with updates, invalidation happens effectively during refreshes. RequestCache uses lucene IndexReader.CacheKey as one of the keys, so if you hit same query again after refresh, different key is generated/stored and making older key/value stale. So we can also replicate this scenario by randomizing query values to generate different keys.
Sample request stats from our experiments. Running same workload against below 2 use cases: With in-memory cache only (1mb size):
"request_cache": {
"memory_size_in_bytes": 1046380,
"evictions": 21231,
"hit_count": 51137, // 70% cache hit ratio
"miss_count": 22063
},
With additional disk tier support
"request_cache": {
"total_memory_size_in_bytes": 11899573,
"total_hit_count": 53559,
"os_cache_memory_size_in_bytes": 1048555,
"disk_cache_memory_size_in_bytes": 10851018,
"total_miss_count": 10652, ~50% less
"total_hit_count":62548 // 85% hit ratio
},
Thanks @sgup432 for the added details!
I do still worry that any benchmarks will have difficulty measuring the added benefit of increased caching, since it heavily depends on the workload executed. Regardless of how many queries are in the benchmark, there will be some cache size that will fit them all, at which point the cache hit ratio is just a function of how many times the queries are executed (since they'll all get cached after the first run).
I'm definitely not trying to argue against tiered caching -- I think it's a neat idea to increase the size of the query cache. (I'm skeptical of adding a remote cache tier unless the index is readonly. That said, if the index is readonly, I think there could be some neat opportunities to precompute caches for frequently co-occurring clauses based on historical queries.) I am trying to say that the specific numbers from benchmarks probably can't be trusted (except to give a theoretical upper bound on potential speedup), though.
@msfroh
I do still worry that any benchmarks will have difficulty measuring the added benefit of increased caching, since it heavily depends on the workload executed.
Agree that it is dependent on the kind of workload being executed. We just used whatever we had i.e. picking up standard nyc_taxis queries, randomizing its values, turning on request cache(it is disabled by default) and run those queries in a loop to create a workload. It gives a good idea around the benefits we might see.
In a real scenario, a domain might see queries getting cached in the first run/miss but never getting a cache hit after that. It can happen either due to invalidation happening(on refresh) or user never calls the same query again. And that controls the cache hit ratio. In our benchmark, we tried to replicate this scenario through extent of randomization(increase/decrease the range) of query field values. Where increasing randomization takes us towards the worst case scenario where we generate a lot of unique queries(which doesn't get a hit after first run), and decreasing randomization takes us towards best scenario where we see a very high cache ratio. I was trying to replicate different cache hit ratios(by tweaking randomization of query field values) and seeing benefits with tiered caching. Above benchmark was one of them. But I am open to suggestions if you have better ideas.
The benefits of tiered caching are also dependent on the query took time. For workloads with very low query latencies(like simple term queries for example), it might not make sense to cache these on disk or use tiered caching in general as it might make performance worse. To handle this, we are taking QueryCost into consideration(as discussed in above design) so that we don't cache these kind of fast queries on disk. Also phase-1 results gives a good overview of the cost of using disk cache based latencies.
I'm skeptical of adding a remote cache tier unless the index is readonly.
Yeah, remote cache tier use case needs more thinking but in general it might be useful for readonly as you mentioned or cases where users have a much larger cache dataset which can be stored in some cheap remote store instead on local disk taking into consideration the added network latency.
For reference: This is the script we used to generate the workload.
@sohami @jainankitk @backslasht @ketanv3 @Bukhtawar
Just read through this, impressive results. Might not be an issue, but I have run into issues with the searchable snapshots cache mechanism: https://github.com/opensearch-project/OpenSearch/issues/11676
Is the implementation for this done is such a manner that a disk that's filling up can never take down the cluster? :)
Interesting proposal and POC results! I'm really looking forward to seeing the improvements this feature can bring to OpenSearch :) One small question regarding the query cost estimation:
Query cost: This can be for example time taken by the query when it was loaded from disk during cache miss. X(configurable) is used as a threshold to decide whether it makes sense to store the query result in lower tier based cache or not
Do we have any high level ideas on how those estimation metrics/data will be collected? Do you think this can be done as part of Query Insights plugin? In which case we can reuse those data to present on the dashboard as well. Also does it make sense to automatically adjust this X
value based on certain rules from our domain knowledge, and send this as "recommendations" to the customer (maybe a future improvement)?
Hey @sgup432, are we on track for this to be released in 2.12 ?
@sandervandegeijn
Is the implementation for this done is such a manner that a disk that's filling up can never take down the cluster? :)
Yes, we are going to have a cap on disk cache size which would be kept configurable so that we don't run into such issues.
Great! This will work together with the high watermark feature that's monitoring disk usage of the nodes?
Will the cache be implemented on the coordinating nodes or on all nodes?
Great! This will work together with the high watermark feature that's monitoring disk usage of the nodes?
We are going to take that into consideration ie not cache items to disk when disk utilization on that node is running low.
Will the cache be implemented on the coordinating nodes or on all nodes?
This proposal is meant to extend existing caches we have by adding a disk tier. Like Request cache, Query cache. All these caches doesn't work on coordinator node level but instead on a shard level lying on a date node.
Great thanks!
Related RFC - https://github.com/opensearch-project/OpenSearch/issues/9001
Co-Author: @jainankitk
Problem Statement
OpenSearch relies heavily on various caches to speed up the data retrieval process and thereby providing significant improvement in search latencies. This is usually done by caching query specific data in memory, avoiding multiple disk seeks and thereby saving resources(CPU/JVM) by not processing the query again. But cache size is limited by the amount of memory available on a node. In cases where we are dealing with larger datasets which can potentially be cached, this causes a lot of cache evictions/misses and potentially impacting performance as we need to process the query again causing high resource consumption.
Solution
Considering the limitations of in memory caches(as described in above section), there could be a significant performance gain if we consider a tiered cache solution. Tiered caching is basically a multi level cache with each tier having it’s own characteristics and performance levels. This will help us in adding more caching tiers so that we have the capability to cache a much larger dataset which is not limited by available memory.
Tiered caching options:
Use cases
Tiered caching option is not meant for every use case. Like for example, where customer is using non cacheable queries or have a search use case(does frequent updates). In such cases this option may not be feasible. We try to list down the use cases/scenarios in which this might be feasible.
Details around Existing caches
IndicesRequestCache: Caches the whole request on a shard level. Not all queries are cacheable and it is decided using this logic. Overall DFS type search queries, range queries with now() date, request with size >0 are not cacheable.
IndicesQueryCache: This caches DocIDSet on a segment level for a shard. Only works for non scoring sub queries like filter queries. As it can cache sub filter queries, this can be used across queries. It utilizes some caching policy to decide whether it needs to cache the query or not.
High level design
Approach: Spillover evicted items to a lower tier cache
Below diagram takes disk based cache as an example. But can be extended for other tiers as well. Here OS in-memory cache refers to the existing OpenSearch caches we currently have like RequestCache/QueryCache.
Key components/features:
This component in below diagram describes the point where the key was not present in both in-memory or disk based cache(as an example) and the value needs to be loaded using default mechanism. In this case loading value means running the query and fetching results. Total cache miss: In case there is a total cache miss for a request i.e. result is not present in either in-memory or disk based cache, we will cache this request inside in-memory cache like the way it happens now. Evicted items will be spilled over to a lower tier based cache.
Cache eviction strategies
OpenSearch uses LRU based eviction policies for in-memory caches(Request/Query cache). LRU is a popular eviction policy due to its simplicity and provides decent cache hit rate in common scenarios. But it is known to be not optimal as it can evict items which might have been required in the near future due to occasional burst of other items.
There are other eviction policies like ARC(Adaptive Replacement Cache), Window TinyLfu etc which are proven to perform better than LRU. But for initial phases, we will stick to existing LRU based policy and eventually consider different policies in later phases.
Cache cleanup
Whenever a invalidation happens, stale keys are left hanging in the cache. Currently OpenSearch has a background job which runs every 1 minute to clean the stale keys lying in the cache. It keeps collecting stale keys by listening to IndexReader close events. We can use similar mechanism for lower tier cache as well.
Milestones
Milestone 1:
Milestone 2:
Milestone 3: