opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.63k stars 1.77k forks source link

[Concurrent Segment Search][META] Performance benchmark plan #9049

Open sohami opened 1 year ago

sohami commented 1 year ago

This issue captures the different performance benchmarks which we plan to do as part of evaluating concurrent segment search which is tracked as part of project board here. We will gather feedback from the community to see if we are missing anything that needs to be considered.

Objective:

Benchmark Plan:

Overview

For concurrent segment search, the concurrency part is introduced for the shard level request and in the query phase. To get the baseline improvement we can use setup with single shard index (of varying shard sizes). With varying number of the search clients sending the request to this single shard we can achieve both of the below as each request is independent of each other.

Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards. The wider the query the better the improvement overall will be as there can be multiple round trips (based on 5 shard per node limit) at the request level. So E2E latency will show better results compared to per shard level index result. This can be found by performing the benchmark comparison with multiple indices of single shard on a node (like 5/10/15 shards) or by extrapolating the baseline number obtained from single shard use case as well.

Things to consider during benchmark setup:

Test Dimensions

Performance Test Scenarios:

sohami commented 1 year ago

@reta @anasalkouz @andrross Would like to get your feedback on this.

andrross commented 1 year ago

Don’t force merge the segments to 1 as done in OSB benchmarks. May be always force merge segments to a common value and keep it same across runs or different cluster setup. For example: For nyc_taxis we can merge to 20 segments across different setup

@sohami Is this correct? The nyc_taxis workload does perform a force merge, but the max number of segments is unset by default, which I believe means that it will merge based on whatever the merge policy is. Is that right? Regardless, the point still stands that we should control the number of segments during testing as it will likely have a big impact on the results.

andrross commented 1 year ago

Shard size: ranging between 20 - 50GB max since thats what we recommend

@sohami Digging a bit into this, AWS recommends 10-30GB for search workloads and 30-50GB for a logs workload. Similarly, Opster gives a general guidance of 30-50GB. Given this, I would suggest using the http_logs workload to approximate a logs use case, and run with shard sizes approaching 50GB. And then maybe use the so workload as an approximation of a search workload with a bit smaller shard size (~20-30GB). What do you think?

sohami commented 1 year ago

Don’t force merge the segments to 1 as done in OSB benchmarks. May be always force merge segments to a common value and keep it same across runs or different cluster setup. For example: For nyc_taxis we can merge to 20 segments across different setup

@sohami Is this correct? The nyc_taxis workload does perform a force merge, but the max number of segments is unset by default, which I believe means that it will merge based on whatever the merge policy is. Is that right? Regardless, the point still stands that we should control the number of segments during testing as it will likely have a big impact on the results.

Thanks reworded, didn't mean to say that benchmark is performing force merge to 1 by default.

Digging a bit into this, AWS recommends 10-30GB for search workloads and 30-50GB for a logs workload. Similarly, Opster gives a general guidance of 30-50GB. Given this, I would suggest using the http_logs workload to approximate a logs use case, and run with shard sizes approaching 50GB. And then maybe use the so workload as an approximation of a search workload with a bit smaller shard size (~20-30GB). What do you think?

Yes that is the idea. I was planning to use nyc_taxis and http_logs instead of so given those are part of the nightly benchmarks as well and nyc_taxis will cover the search use case. so I think is used for indexing workload. http_logs has custom data generator so was planning to use that workload first with couple of different shard size. It also has similar query types as in nyc_taxis like range/aggregations and few others like term match, etc. Post that will use nyc_taxis which will represent the search workload and its data set size is ~23GB which is fine for search workload.

andrross commented 1 year ago

so I think is used for indexing workload.

@sohami Yes, of course you're right. The so workload doesn't even define any query-related operations :)

My main concern with nyc_taxis is that each document is basically a collection of dates, numbers, and lat/lon pairs (example doc) so I don't think it is super representative of a search workload. Do you think that is a problem?

gashutos commented 1 year ago

@sohami very curious if concurrent search will help http_logs workload. Specially search_after & sort queries. Even range queries. Those queries use BKD based skiping documents logic and have capabilities to skip entire segments based on search results it got from earlier segments. So in case of concurrent search, lets see if concurent search actully ends up consuming more CPU as well latency too.

reta commented 1 year ago

Thanks @sohami

Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards.

I am 100% agree with you here - the single shard scenario is highly unrealistic, I would suggest to exclude it from the benchmarking (it could be useful for troubleshooting fe but this workload over such configuration is non representative).

Change the thread count of search thread-pool to a) numProcessor b) numProcessor/2 instead of default 1.5x processor count. This will be done on instances such as 2xl and 8xl.

Do you mean index searcher pool?

@sohami very curious if concurrent search will help http_logs workload. Specially search_after & sort queries. Even range queries.

I think http_logs is mentioned in Test Dimensions. The pmc workload is also very useful since it runs number of queries and aggregations, we may consider it.

sohami commented 1 year ago

so I think is used for indexing workload.

@sohami Yes, of course you're right. The so workload doesn't even define any query-related operations :)

My main concern with nyc_taxis is that each document is basically a collection of dates, numbers, and lat/lon pairs (example doc) so I don't think it is super representative of a search workload. Do you think that is a problem?

The main goal is to exercise different query types (like range/term/aggregations) and use the workloads available in OSB. These operations should be common across search/log analytics use cases. Since nyc_taxis and http_logs are the workloads which we are using for baseline performance for each release, I think we should be fine to use these.

sohami commented 1 year ago

Note: The improvement at per shard level may not be the actual perceived latency improvement by the user. In real world, each top level search request usually will touch multiple shards.

I am 100% agree with you here - the single shard scenario is highly unrealistic, I would suggest to exclude it from the benchmarking (it could be useful for troubleshooting fe but this workload over such configuration is non representative).

@reta What I mean by this is with single shard setup we will get the best possible improvement with concurrent execution. So it is important to understand that and see the behavior for benchmark. Also it is not entirely unrealistic as for some of the search use cases users assign 1 shard per node in their setup.

For understanding behavior with multiple shards being searched on a node, we will have multiple client sending search request to single shard and say with CPU utilization at 50%. We can use that to run the same workload with concurrent search enabled and see the behavior. Having said that we will also run single client and multiple shards on a node scenarios too and then searching on all the shards. The expectation is with shard per node > 5 we should ideally see the latency improvement to multiply as in that case for each search request multiple round trips will be made.

Change the thread count of search thread-pool to a) numProcessor b) numProcessor/2 instead of default 1.5x processor count. This will be done on instances such as 2xl and 8xl.

Do you mean index searcher pool?

No I meant search pool, currently search threadpool is set to 1.5x processor count and if all the threads are busy then it ends up consuming all the available cores and reaching CPU utilization of ~100%. I want to see if we vary the search pool and with concurrent search enabled how system behaves w.r.t default setup.

I think http_logs is mentioned in Test Dimensions. The pmc workload is also very useful since it runs number of queries and aggregations, we may consider it.

Will take a look at it

reta commented 1 year ago

Thanks @sohami

No I meant search pool,

:+1: We definitely should include the index searcher sizing as well I think, since this is the one index searcher will use.

sandervandegeijn commented 11 months ago

We enabled this on our SIEM-cluster while trying to improve the performance, see also

https://github.com/opensearch-project/OpenSearch/issues/10859

We saw throughput rise with ±45% (400 -> 580MB/s) and haven't encountered any adverse effects as far as I can tell. Response times in Dashboards are notably faster (no formal numbers, just a users experience and I'm sensitive to lag :) )

I don't have end to end performance traces but from experience I suspect I can extract more performance from a single search node when I could up the amount of parallelism against our S3 solution.