[Concurrent Segment Search][META] Performance benchmark plan

sohami commented 1 year ago

This issue captures the different performance benchmarks which we plan to do as part of evaluating concurrent segment search which is tracked as part of project board here. We will gather feedback from the community to see if we are missing anything that needs to be considered.

Objective:

Figure out trade-offs with latency improvement and resources used with concurrent vs sequential search
Identify trade-offs between lucene default slicing vs custom slicing mechanism. Initial test showing improvement with custom slicing is captured here
Finding sweet spot for thread pool size or slices count based on the performance results which can be used as defaults

Benchmark Plan:

Overview

For concurrent segment search, the concurrency part is introduced for the shard level request and in the query phase. To get the baseline improvement we can use setup with single shard index (of varying shard sizes). With varying number of the search clients sending the request to this single shard we can achieve both of the below as each request is independent of each other.

The behavior of multiple top level search request on a node to a single shard
The behavior of multiple top level search request on a node to different shards

Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards. The wider the query the better the improvement overall will be as there can be multiple round trips (based on 5 shard per node limit) at the request level. So E2E latency will show better results compared to per shard level index result. This can be found by performing the benchmark comparison with multiple indices of single shard on a node (like 5/10/15 shards) or by extrapolating the baseline number obtained from single shard use case as well.

Things to consider during benchmark setup:

Don’t force merge the segments to 1. May be always force merge segments to a common value and keep it same across runs or different cluster setup. For example: For nyc_taxis we can merge to 20 segments across different setup
Disable the request level cache
Disable the target_throughput by setting it to none in workload params
Capture telemetry information such as for CPU/JVM/Disk Stats/ GC Stats
Once the test cluster is setup we can disable all the delete / append / index operation in the workload such that only search workload is executed on the same dataset everytime.
Use sufficiently large box to run the OSB benchmark clients

Test Dimensions

Workload
- Http_Logs (preferred as have more types of query and have custom data generation which will be useful to create shards of required size). Ref here
- Nyc_Taxis
Number of search clients sending the top level search requests.
Custom Slice count vs Lucene Default Slice count
Shard size: ranging between 20 - 50GB max since thats what we recommend
Instance Type with different number of cores:
- EBS Based instance types:
  - r5.large, r5.2xlarge, r5.8xlarge
  - r6g.large, r6g.2xlarge, r6g.8xlarge
- SSD Based instance types:
  - i3.large, i3.2xlarge, i3.8xlarge
Index Types: DocRep vs SegRep
- SegRep index doesn’t have any change in search execution path. It only affect the refresh lag so should not have different performance with concurrent search as compared to DocRep. But for functional verification we will do OSB runs with SegRep enabled indices as well

Performance Test Scenarios:

Base test to measure the performance improvement with concurrent segment search at shard level: Test with 1 node and 1 shard and 1 client with no throughput throttle.
- CS Disabled
- CS Enabled with lucene default slice count computation
- CS Enabled with slice count of 2/4/6/8
Find the number of clients with which we are achieving ~50-90% cpu usage during the execution of the test with concurrent segment search disabled. This is to create the busy workload from CPU utilization perspective and see impact of concurrent segment search on those workloads. We will still keep the single shard and single node.
- Use the same test and client count and run with CS enabled with lucene default
- Use the same test and client count and run with CS enabled with slice 2/4/6
Change the thread count of search thread-pool to a) numProcessor b) numProcessor/2 instead of default 1.5x processor count. This will be done on instances such as 2xl and 8xl.
- Run the test with CS disabled, numClients as numProcessors*1.5 compare the throughput with default search threadpool size.
- Run the test with CS enabled, numClients still same as numProcessors * 1.5 and top level search threadpool size as: numProcessor and numProcessor/2.
  - Compare the metrics with CS disabled case with current default search threadpool size
  - Compare the metrics with CS disabled case with new search thread pool size
  - Note: This test can show if we can tune the search thread-pool such that concurrent search can be enabled by default for a workload.
Multiple shard being searched per node such as 5/10/15 shards per node.
- Single client case with CS disabled and CS enabled
- Multiple client case with CS disabled and CS enabled
Need to test the behavior of optimizations made for search_after and sort queries with concurrent search
- We can also compare using concurrent search without these optimizations in place
Multiple nodes like 3 with 1 shard per node (prod like usecase and is also the nightly benchmark setup).
- Single/Multiple client with CS disabled
- Single/Multiple client with CS Enabled ---> use lucene default or custom slicer based on results from above
Perform the benchmark with terminate_after clause and compare it with concurrent search disabled scenario. We expect terminate_after to perform more work with concurrent search, so we should see if there is any regression for such cases.

sohami commented 1 year ago

@reta @anasalkouz @andrross Would like to get your feedback on this.

andrross commented 1 year ago

Don’t force merge the segments to 1 as done in OSB benchmarks. May be always force merge segments to a common value and keep it same across runs or different cluster setup. For example: For nyc_taxis we can merge to 20 segments across different setup

@sohami Is this correct? The nyc_taxis workload does perform a force merge, but the max number of segments is unset by default, which I believe means that it will merge based on whatever the merge policy is. Is that right? Regardless, the point still stands that we should control the number of segments during testing as it will likely have a big impact on the results.

andrross commented 1 year ago

Shard size: ranging between 20 - 50GB max since thats what we recommend

@sohami Digging a bit into this, AWS recommends 10-30GB for search workloads and 30-50GB for a logs workload. Similarly, Opster gives a general guidance of 30-50GB. Given this, I would suggest using the http_logs workload to approximate a logs use case, and run with shard sizes approaching 50GB. And then maybe use the so workload as an approximation of a search workload with a bit smaller shard size (~20-30GB). What do you think?

sohami commented 1 year ago

Don’t force merge the segments to 1 as done in OSB benchmarks. May be always force merge segments to a common value and keep it same across runs or different cluster setup. For example: For nyc_taxis we can merge to 20 segments across different setup

@sohami Is this correct? The nyc_taxis workload does perform a force merge, but the max number of segments is unset by default, which I believe means that it will merge based on whatever the merge policy is. Is that right? Regardless, the point still stands that we should control the number of segments during testing as it will likely have a big impact on the results.

Thanks reworded, didn't mean to say that benchmark is performing force merge to 1 by default.

Digging a bit into this, AWS recommends 10-30GB for search workloads and 30-50GB for a logs workload. Similarly, Opster gives a general guidance of 30-50GB. Given this, I would suggest using the http_logs workload to approximate a logs use case, and run with shard sizes approaching 50GB. And then maybe use the so workload as an approximation of a search workload with a bit smaller shard size (~20-30GB). What do you think?

Yes that is the idea. I was planning to use nyc_taxis and http_logs instead of so given those are part of the nightly benchmarks as well and nyc_taxis will cover the search use case. so I think is used for indexing workload. http_logs has custom data generator so was planning to use that workload first with couple of different shard size. It also has similar query types as in nyc_taxis like range/aggregations and few others like term match, etc. Post that will use nyc_taxis which will represent the search workload and its data set size is ~23GB which is fine for search workload.

andrross commented 1 year ago

so I think is used for indexing workload.

@sohami Yes, of course you're right. The so workload doesn't even define any query-related operations :)

My main concern with nyc_taxis is that each document is basically a collection of dates, numbers, and lat/lon pairs (example doc) so I don't think it is super representative of a search workload. Do you think that is a problem?

gashutos commented 1 year ago

@sohami very curious if concurrent search will help http_logs workload. Specially search_after & sort queries. Even range queries. Those queries use BKD based skiping documents logic and have capabilities to skip entire segments based on search results it got from earlier segments. So in case of concurrent search, lets see if concurent search actully ends up consuming more CPU as well latency too.

reta commented 1 year ago

Thanks @sohami

Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards.

I am 100% agree with you here - the single shard scenario is highly unrealistic, I would suggest to exclude it from the benchmarking (it could be useful for troubleshooting fe but this workload over such configuration is non representative).

Change the thread count of search thread-pool to a) numProcessor b) numProcessor/2 instead of default 1.5x processor count. This will be done on instances such as 2xl and 8xl.

Do you mean index searcher pool?

@sohami very curious if concurrent search will help http_logs workload. Specially search_after & sort queries. Even range queries.

I think http_logs is mentioned in Test Dimensions. The pmc workload is also very useful since it runs number of queries and aggregations, we may consider it.

sohami commented 1 year ago

so I think is used for indexing workload.

@sohami Yes, of course you're right. The so workload doesn't even define any query-related operations :)

My main concern with nyc_taxis is that each document is basically a collection of dates, numbers, and lat/lon pairs (example doc) so I don't think it is super representative of a search workload. Do you think that is a problem?

The main goal is to exercise different query types (like range/term/aggregations) and use the workloads available in OSB. These operations should be common across search/log analytics use cases. Since nyc_taxis and http_logs are the workloads which we are using for baseline performance for each release, I think we should be fine to use these.

sohami commented 1 year ago

Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards.

I am 100% agree with you here - the single shard scenario is highly unrealistic, I would suggest to exclude it from the benchmarking (it could be useful for troubleshooting fe but this workload over such configuration is non representative).

@reta What I mean by this is with single shard setup we will get the best possible improvement with concurrent execution. So it is important to understand that and see the behavior for benchmark. Also it is not entirely unrealistic as for some of the search use cases users assign 1 shard per node in their setup.

For understanding behavior with multiple shards being searched on a node, we will have multiple client sending search request to single shard and say with CPU utilization at 50%. We can use that to run the same workload with concurrent search enabled and see the behavior. Having said that we will also run single client and multiple shards on a node scenarios too and then searching on all the shards. The expectation is with shard per node > 5 we should ideally see the latency improvement to multiply as in that case for each search request multiple round trips will be made.

Change the thread count of search thread-pool to a) numProcessor b) numProcessor/2 instead of default 1.5x processor count. This will be done on instances such as 2xl and 8xl.

Do you mean index searcher pool?

No I meant search pool, currently search threadpool is set to 1.5x processor count and if all the threads are busy then it ends up consuming all the available cores and reaching CPU utilization of ~100%. I want to see if we vary the search pool and with concurrent search enabled how system behaves w.r.t default setup.

I think http_logs is mentioned in Test Dimensions. The pmc workload is also very useful since it runs number of queries and aggregations, we may consider it.

Will take a look at it

reta commented 1 year ago

Thanks @sohami

No I meant search pool,

:+1: We definitely should include the index searcher sizing as well I think, since this is the one index searcher will use.

sandervandegeijn commented 11 months ago

We enabled this on our SIEM-cluster while trying to improve the performance, see also

https://github.com/opensearch-project/OpenSearch/issues/10859

We saw throughput rise with ±45% (400 -> 580MB/s) and haven't encountered any adverse effects as far as I can tell. Response times in Dashboards are notably faster (no formal numbers, just a users experience and I'm sensitive to lag :) )

I don't have end to end performance traces but from experience I suspect I can extract more performance from a single search node when I could up the amount of parallelism against our S3 solution.

opensearch-project / OpenSearch