opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.7k stars 1.8k forks source link

[BUG] Performance Regression in 2.14 and 3.0 hourly_aggs in http_logs workload #13345

Closed mgodwan closed 5 months ago

mgodwan commented 6 months ago

Describe the bug

https://opensearch.org/benchmarks

Screenshot 2024-04-23 at 2 10 27 PM

The P90 latency observed for hourly_aggs query has regressed over the last weekk

Related component

Search:Aggregations

Expected behavior

Latency should not increase

Additional Details

No response

mgodwan commented 6 months ago

Tagging @getsaurabh02 @msfroh @bbarani to see if they may be aware of any changes. In parallel, looking through the commit history to see if I can find some commit which could've cause this.

mgodwan commented 6 months ago

One of the commits (on the same day when regression started) which touch aggregation path slightly: https://github.com/opensearch-project/OpenSearch/commit/8332859ff28c9cd03e468cbd0e7b97092fd795ee [Can be evaluated if this could have had some impact]

getsaurabh02 commented 6 months ago

@mgodwan This looks related to the https://github.com/opensearch-project/OpenSearch/pull/13179 where @bowenlan-amzn has added cluster setting to dynamically disable filter rewrite optimization.

Based on the description it reduces the deciding threshold for rewrite filters from 1024 to 24. Meaning if the date histogram aggregation include more than 24 buckets (e.g. hourly aggregation of 1 day), we won't use the optimization After this change, we will probably see regression for date_histogram_hourly_agg of big5 workload. That will be handled after the long term solution merged in next.

bowenlan-amzn commented 6 months ago

The change causing this is adding a dynamic cluster setting to decrease the threshold of apply our optimization on date histogram. The threshold is the number of filters rewritten from date histogram. Previous 1024 is reported to causing regression on pmc workload.

Since it's a dynamic setting, it won't actually cause regression for users and instead giving them ability to tune for their workload.

The PR for long term fix: https://github.com/opensearch-project/OpenSearch/pull/13317

mgodwan commented 6 months ago

Thanks @bowenlan-amzn

Since it's a dynamic setting, it won't actually cause regression for users and instead giving them ability to tune for their workload.

Is this setting enabled for the benchmark setup where we are seeing regression?

bowenlan-amzn commented 6 months ago

The setting is a threshold. This operation of http workload currently exceed the threshold so our previous optimization is disabled, hence the regression.

peternied commented 6 months ago

[Triage - attendees 1 2 3 4 5 6] @mgodwan Thanks for creating this issue. This looks like a potential release blocking for v2.14. Please let me know if you need any help getting eyes on this issue.

mgodwan commented 6 months ago

The setting is a threshold. This operation of http workload currently exceed the threshold so our previous optimization is disabled, hence the regression.

@bowenlan-amzn Do we need to revisit the threshold defaults in that case as the current ones have shown to cause regression?

bowenlan-amzn commented 5 months ago

Fix/Improvements merged in