Open sohami opened 1 year ago
@reta @anasalkouz @andrross Would like to get your feedback on this.
Don’t force merge the segments to 1 as done in OSB benchmarks. May be always force merge segments to a common value and keep it same across runs or different cluster setup. For example: For nyc_taxis we can merge to 20 segments across different setup
@sohami Is this correct? The nyc_taxis workload does perform a force merge, but the max number of segments
is unset by default, which I believe means that it will merge based on whatever the merge policy is. Is that right? Regardless, the point still stands that we should control the number of segments during testing as it will likely have a big impact on the results.
Shard size: ranging between 20 - 50GB max since thats what we recommend
@sohami Digging a bit into this, AWS recommends 10-30GB for search workloads and 30-50GB for a logs workload. Similarly, Opster gives a general guidance of 30-50GB. Given this, I would suggest using the http_logs
workload to approximate a logs use case, and run with shard sizes approaching 50GB. And then maybe use the so
workload as an approximation of a search workload with a bit smaller shard size (~20-30GB). What do you think?
Don’t force merge the segments to 1 as done in OSB benchmarks. May be always force merge segments to a common value and keep it same across runs or different cluster setup. For example: For nyc_taxis we can merge to 20 segments across different setup
@sohami Is this correct? The nyc_taxis workload does perform a force merge, but the
max number of segments
is unset by default, which I believe means that it will merge based on whatever the merge policy is. Is that right? Regardless, the point still stands that we should control the number of segments during testing as it will likely have a big impact on the results.
Thanks reworded, didn't mean to say that benchmark is performing force merge to 1 by default.
Digging a bit into this, AWS recommends 10-30GB for search workloads and 30-50GB for a logs workload. Similarly, Opster gives a general guidance of 30-50GB. Given this, I would suggest using the http_logs workload to approximate a logs use case, and run with shard sizes approaching 50GB. And then maybe use the so workload as an approximation of a search workload with a bit smaller shard size (~20-30GB). What do you think?
Yes that is the idea. I was planning to use nyc_taxis
and http_logs
instead of so
given those are part of the nightly benchmarks as well and nyc_taxis
will cover the search use case. so
I think is used for indexing workload. http_logs
has custom data generator so was planning to use that workload first with couple of different shard size. It also has similar query types as in nyc_taxis
like range/aggregations and few others like term match, etc. Post that will use nyc_taxis
which will represent the search workload and its data set size is ~23GB which is fine for search workload.
so
I think is used for indexing workload.
@sohami Yes, of course you're right. The so
workload doesn't even define any query-related operations :)
My main concern with nyc_taxis
is that each document is basically a collection of dates, numbers, and lat/lon pairs (example doc) so I don't think it is super representative of a search workload. Do you think that is a problem?
@sohami very curious if concurrent search will help http_logs workload. Specially search_after
& sort
queries. Even range
queries.
Those queries use BKD based skiping documents logic and have capabilities to skip entire segments based on search results it got from earlier segments. So in case of concurrent search, lets see if concurent search
actully ends up consuming more CPU as well latency too.
Thanks @sohami
Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards.
I am 100% agree with you here - the single shard scenario is highly unrealistic, I would suggest to exclude it from the benchmarking (it could be useful for troubleshooting fe but this workload over such configuration is non representative).
Change the thread count of search thread-pool to a) numProcessor b) numProcessor/2 instead of default 1.5x processor count. This will be done on instances such as 2xl and 8xl.
Do you mean index searcher
pool?
@sohami very curious if concurrent search will help http_logs workload. Specially search_after & sort queries. Even range queries.
I think http_logs
is mentioned in Test Dimensions. The pmc
workload is also very useful since it runs number of queries and aggregations, we may consider it.
so
I think is used for indexing workload.@sohami Yes, of course you're right. The
so
workload doesn't even define any query-related operations :)My main concern with
nyc_taxis
is that each document is basically a collection of dates, numbers, and lat/lon pairs (example doc) so I don't think it is super representative of a search workload. Do you think that is a problem?
The main goal is to exercise different query types (like range/term/aggregations) and use the workloads available in OSB. These operations should be common across search/log analytics use cases. Since nyc_taxis
and http_logs
are the workloads which we are using for baseline performance for each release, I think we should be fine to use these.
Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards.
I am 100% agree with you here - the single shard scenario is highly unrealistic, I would suggest to exclude it from the benchmarking (it could be useful for troubleshooting fe but this workload over such configuration is non representative).
@reta What I mean by this is with single shard setup we will get the best possible improvement with concurrent execution. So it is important to understand that and see the behavior for benchmark. Also it is not entirely unrealistic as for some of the search use cases users assign 1 shard per node in their setup.
For understanding behavior with multiple shards being searched on a node, we will have multiple client sending search request to single shard and say with CPU utilization at 50%. We can use that to run the same workload with concurrent search enabled and see the behavior. Having said that we will also run single client and multiple shards on a node scenarios too and then searching on all the shards. The expectation is with shard per node > 5 we should ideally see the latency improvement to multiply as in that case for each search request multiple round trips will be made.
Change the thread count of search thread-pool to a) numProcessor b) numProcessor/2 instead of default 1.5x processor count. This will be done on instances such as 2xl and 8xl.
Do you mean index searcher pool?
No I meant search pool, currently search threadpool is set to 1.5x processor count and if all the threads are busy then it ends up consuming all the available cores and reaching CPU utilization of ~100%. I want to see if we vary the search pool and with concurrent search enabled how system behaves w.r.t default setup.
I think http_logs is mentioned in Test Dimensions. The pmc workload is also very useful since it runs number of queries and aggregations, we may consider it.
Will take a look at it
Thanks @sohami
No I meant search pool,
:+1: We definitely should include the index searcher sizing as well I think, since this is the one index searcher will use.
We enabled this on our SIEM-cluster while trying to improve the performance, see also
https://github.com/opensearch-project/OpenSearch/issues/10859
We saw throughput rise with ±45% (400 -> 580MB/s) and haven't encountered any adverse effects as far as I can tell. Response times in Dashboards are notably faster (no formal numbers, just a users experience and I'm sensitive to lag :) )
I don't have end to end performance traces but from experience I suspect I can extract more performance from a single search node when I could up the amount of parallelism against our S3 solution.
This issue captures the different performance benchmarks which we plan to do as part of evaluating concurrent segment search which is tracked as part of project board here. We will gather feedback from the community to see if we are missing anything that needs to be considered.
Objective:
Benchmark Plan:
Overview
For concurrent segment search, the concurrency part is introduced for the shard level request and in the query phase. To get the baseline improvement we can use setup with single shard index (of varying shard sizes). With varying number of the search clients sending the request to this single shard we can achieve both of the below as each request is independent of each other.
Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards. The wider the query the better the improvement overall will be as there can be multiple round trips (based on 5 shard per node limit) at the request level. So E2E latency will show better results compared to per shard level index result. This can be found by performing the benchmark comparison with multiple indices of single shard on a node (like 5/10/15 shards) or by extrapolating the baseline number obtained from single shard use case as well.
Things to consider during benchmark setup:
nyc_taxis
we can merge to 20 segments across different setupTest Dimensions
Performance Test Scenarios:
search_after
andsort
queries with concurrent searchterminate_after
clause and compare it with concurrent search disabled scenario. We expectterminate_after
to perform more work with concurrent search, so we should see if there is any regression for such cases.