Closed CaptainDredge closed 8 months ago
Benchmark results for checkPendingFlushUpdate=false
i.e. disabled this param by default
Dataset: nyc_taxis
Server configuration:
Server: 1 node r6g.2xlarge, 32G Heap, 64G RAM with 3 master node r6g.2xlarge, 32G Heap
Client: c6g.8xlarge, 16 client instances
Command:
opensearch-benchmark execute_test --pipeline=benchmark-only --target-hosts <HOST> --workload nyc_taxis --offline --results-format csv --results-file ~/nyc_default_false --client-options timeout:60,use_ssl:false,verify_certs:false
Glossary
Candidate -> checkPendingFlushUpdate=false
Baseline -> checkPendingFlushUpdate=true
Baseline CPU Profile:
Candidate CPU Profile:
Baseline JVM:
Candidate JVM:
Baseline CPU:
Candidate CPU:
Although tail latency i.e. p99 has decreased and there's a slight improvement in indexing throughput but other metrics like p90, merge throttling time etc. suffered.
What are your thoughts on the numbers and overall idea of disabling flush param @mgodwan / @backslasht / @shwetathareja ?
Thanks @CaptainDredge for the benchmark and the suggested tuning. I'm still going through the overall optimization and measurements. In the meanwhile, I've a few questions if you can help answer them:
write
threads being blocked? I guess async-profiler/JFR should help collecting this if not already present.Sure I think I need to track few more metrics during benchmark run to address above Questions. Will try to add those and do 1-2 runs
Few questions on top of what @mgodwan asked :
Redid the benchmarks with additional metric tracking and got much consistent results
Benchmark results for checkPendingFlushUpdate=false
i.e. disabled this param by default
Dataset: nyc_taxis
Server configuration:
Server: 1 node r6g.2xlarge, 32G Heap, 64G RAM with 3 master node r6g.2xlarge, 32G Heap
Client: c6g.8xlarge, 5 client instances
Command:
opensearch-benchmark execute_test --pipeline=benchmark-only --target-hosts <HOST> --workload nyc_taxis --offline --results-format csv --results-file ~/nyc_default_false --client-options timeout:60,use_ssl:false,verify_certs:false
Glossary
Candidate -> checkPendingFlushUpdate=false
Baseline -> checkPendingFlushUpdate=true
CPU Profile comparison [ A -> candidate, B -> Baseline ]
doFlush
as well as preUpdate
functions where threads check for flush are called more frequently in candidate
Refreshes are higher in baseline due to more number of segments in baseline
Macro CPU:
JVM Comparison:
Lock profiles:
Candidate:
Baseline:
Flush metrics: Number of flushes are higher in baseline because writer threads check for flush on each update and thus flushes happens frequently but total flush time is higher in candidate
Number of segments less in candidate which is expected since flushes are occurring less frequently
Indexing writer is holding segments in memory a little longer but the memory occupied is not too concerning (<=1Gb)
There's 20% improvement in throughput, 10% average latency and 80% tail latency(p99) improvements
What are your thoughts on the numbers and overall idea of disabling flush param now @mgodwan / @backslasht / @shwetathareja ?
Next steps: I'm currently benchmarking with a custom workload where we can test parallel indexing and search performance. I expect search performance to be a lot better in this case while indexing is going on
Thanks @CaptainDredge for the numbers. Looking forward to seeing how search is impacted as well.
Can you highlight how checkPendingFlushUpdate
comes into picture to create segments i.e. the thresholds and frequency at which it operates?
Thanks for the benchmarks @CaptainDredge , the nos. definitely look encouraging. Its a good surprise to see overall throughput improvement as well. We should understand better if over throughput improvement is coming from lesser segments and in turn less merges combined with lower latencies.
Few more question to understand the results better:
Also on the trade off, the instances where the disk is slow and flushes are impacted (and can get piled up), what will happen in those cases? Currently, write thread help to catch up. At some point we will need to throttle the indexing?
Also on the trade off, the instances where the disk is slow and flushes are impacted (and can get piled up), what will happen in those cases? Currently, write thread help to catch up. At some point we will need to throttle the indexing?
I believe IndexingMemoryController should kick in to throttle the writes in such a case. https://github.com/opensearch-project/OpenSearch/blob/b4306b624e5546dd102a4ed45470564b31d5f3a0/server/src/main/java/org/opensearch/indices/IndexingMemoryController.java#L434-L436
That said, it will be helpful to see the benchmark behavior this change brings in when the resource under contention is disk IOPS/throughput.
I benchmarked with a custom workload with total uncompressed size of around 600Gb distributed equally among 3 indices. It Indexes the 50% document corpus initially. Document ids are unique so all index operations are append only. After, ingestion occurs again with remaining 5[0]% document corpus along with a couple of filter queries and aggregations which are run parallely. I used generous warmup time period for ingestion ~ 30 mins and good amount of warmup iteration ~ 100 for search queries. No explicit refresh interval was set
Baseline: checkPendingFlushOnUpdate = true Candidate: checkPendingFlushOnUpdate = false
This includes one optimization which I identified through lock profiling where we short circuited a condition to prevent write threads getting blocked to check if the size of pending flushes queue > 0. Opened the issue in lucene for the same. This overall brought an increase of 7% throughput and 6-7% latency improvement. Hereafter the results contain this optimization as well
Test procedure: Initial index append similar to nyc
Test Procedure: Concurrent Index append i.e. ingesting 5% doc corpus parallely with search
Test Procedure: Filter queries running parallely with ingestion
Test Procedure: Aggregation queries run parallely with ingestion
Overall Flush metrics
Average Latency: Left of Separator -> Canidate, Right of separator -> Baseline
Flush Sizes(Mb): Flush sizes are higher in candidate which means segment sizes are larger
Baseline:
Candidate:
Flush Time(ms) : Flush times are higher in candidate since large segments are now getting flushed to disk at a time
Baseline:
Candidate:
Number of flushes completed over time is still same in both baseline and candidate
Notable conclusions from the experiment
Answering few Important Qns
Why we're seeing significant improvement in throughput? write threads are now free from doing flushes and hence more cpu bandwidth is available for indexing which explains increase in indexing throughput. Search througput increase is explained by a decrease in search latency due to larger size segments i.e. a single segment now contains more docs
why the p50 latencies are degraded in candidate for indexing? flush code will only be triggered if flush is lagging behind. I was not able to explain p50 degradation through all the tracked metrics and async profiles. I tried a couple of things like reducing locking at certain places but the improvement of latency were not so much to be better than baseline. Currently, my hypothesis is that since larger segments are being made due to higher indexing throughput, the buffer allocations for them are taking more time. I'm trying to validate this hypothesis.
why should total flush time increase with candidate? Larger segments are getting flushed to disk now so it takes more time to write. Its a classic latency Vs throughput tradeoff where we write larger chunks to disk throughput on the expense of increased latency
Will this cause flush backlog? This can definitely cause a pile of flushes because write threads are not helping in flushing now but when the flush backlog exceeds certain memory thresholds IndexingMemoryController or dwpt flush control throttling will kick in depending on whether the flush backlog is on a node or shard level which will block writes to catchup on flushing
Will this be detrimental to search freshness? Yes since the segment will be available for searching after a comparatively longer time since the flushing is taking more time, Disabling write threads participation entirely is a naive optimisation on which we've to improve. We should ideally start helping with flushing somewhere between when flushing starts and throttling starts to restore to old behaviour when flushing is slow.
Thanks @CaptainDredge for the detailed analysis.
I see that we're seeing constant improvement in indexing throughput and tail latencies (as expected since write threads can now work on actual indexing instead of helping the refresh operation for flushing segments) but at the same time, P50 latency seems to have degraded for indexing (as you've pointed out). Do we understand the cause behind it now? It may help to add some trace logging in the active indexing path (i.e. queuing + processing within lucene upto IndexingChain#processDocuments to understand this further) Please update when you're able to valid the hypothesis you've suggested as well.
Search througput increase is explained by a decrease in search latency due to larger size segments i.e. a single segment now contains more docs
Is there any metric confirming this hypoethesis? Did you track the segment sizes over time for your concurrent indexing-search workload?
Larger segments are getting flushed to disk now so it takes more time to write.
Doesn't it also reduce the number of segments written? Will cumulative flush time
reported by OSB include the latency for flush operation to finish which will now not be helped on by write threads, and reduce the concurrency and leading to higher perceived latency?
This can definitely cause a pile of flushes because write threads are not helping in flushing
Will this cause an increase in the overall time it takes for documents to be visible for search as well? Is there a hybrid approach we can try where we let refresh thread take care of largest segments whereas write threads still continue to help with smaller segments so that the additional work write threads are doing is reduced through some setting (compared to 0 in the current experiments).
Also, since refreshes for shards of a single index on a node are serailized, if we plan to incorporate this change, we should look into better scheduling of refreshes across shards as well (while ensuring fairness). In case of multiple shards for the same index on a node, we may be able to reduce index level lag by moving the concurrency to a higher level.
Do we understand the cause behind it now? It may help to add some trace logging in the active indexing path (i.e. queuing + processing within lucene upto IndexingChain#processDocuments to understand this further) Please update when you're able to valid the hypothesis you've suggested as well.
I did a per request trace analysis to find out the p50 time differences we were seeing in candidate and baseline. I divided the indexing code flow path in 3 mutually exclusive parts 1.) IndexIntoLucene ( this function is the one which delegates indexing to lucene index writer for indexing) 2. translog#Add operation 3. MarkSeqNoAsProcessed. Initially the suspects were translog add and markSeqNo as processed operation because a higher locking time was being observed in collected JFR profiles but after adding log tracing on a per request basis their contribution to added p50 latency was minimal although we still see them consuming a little more time in candidate i.e. pendingFlushCheck disabled case. Here are the per request comparison graphs for these 3 parts
shardBulkAction time distribution:
ShardBulkAction p50 latency: baseline=243440810.0, candidate=259049053.0, % diff=6.411514568982908
IndexIntoLucene time distribution:
Index Into Lucene p50 latency: baseline=243440810.0, candidate=259049053.0, % diff=10.580723364219258
Translog Add time distribution:
Translog#Add p50 latency: baseline=9797628.0, candidate=10864336.5, % diff=10.88741581125554
MarkSeqNo time distribution
MarkSeqNoAsProcessed p50 latency: baseline=2603495.0, candidate=2651995.0, % diff=1.862880474131888
From above charts mainly lucene writer is taking more time ( ~ 10%) to complete bulk indexing request which explains the p50 latency differences. Its not feasible currently to add request level tracing at lucene level for further exploration of what operations are taking more time at index writer level to complete.
Few other interesting points to note are
The above findings are for the custom workload we had which. does concurrent indexing and search. I experimented further with lesser refresh interval 1s
and found out that this gives us the expected throughput improvements with improved tail latency and similar p50 latencies.
This should be helpful in many cases for that I've raised a PR to expose this setting here https://github.com/opensearch-project/OpenSearch/pull/12710
Addressing concerns around refresh lags:
Will this cause an increase in the overall time it takes for documents to be visible for search as well? Is there a hybrid approach we can try where we let refresh thread take care of largest segments whereas write threads still continue to help with smaller segments so that the additional work write threads are doing is reduced through some setting (compared to 0 in the current experiments).
I traced async refresh action to find out if we see added lag in refreshes in the candidate case and I found it to be similar in both the cases
Refresh p50 lag: baseline=31196.0, candidate=28931.0, % diff=-7.260546223874856
cc: @mgodwan
Is your feature request related to a problem? Please describe
Opensearch triggers a full flush on lucene based on the translog size or maybe if a refresh is triggered(sometimes) or flush API is called.
Sometimes indexing threads end up flushing the segments to the disk when a full flush has been triggered. This incurs latency overhead and reduces the throughput keeping the number of clients same.
Looking deeper, it appears that indexing threads check for pending flushes on update in order to help flushing indexing buffers to disk. We can turn this off by disabling checkPendingFlushUpdate parameter in index config.
Also IndexingMemoryController reduces the total number of indexing threads per shard to 1, in order to throttle indexing when it breaches 1.5 times of indexing buffer. It starts limiting the indexing rate starting with the shard with largest memory accounting.
Describe the solution you'd like
We should ideally keep checkPendingFlushUpdate disable but we do want the indexing threads to start help with flushing somewhere between when flushing starts and throttling starts to restore to old behaviour when flushing is slow. checkForPendingFlush is a LiveIndexWriterConfig, which means that we can modify it on a live index writer instance. We can follow the existing procedure for enabling this flag used for throttling in IndexingMemoryController- start with the shard that has largest memory accounting and disable it once the memory is below the threshold.
Related component
Indexing:Performance
Describe alternatives you've considered
No response
Additional context
No response