[BUG] Regression in nyc_taxis desc_sort_tip_amount between 2.16 and 2.17

opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.

https://opensearch.org/docs/latest/opensearch/index/

Apache License 2.0

9.67k stars 1.78k forks source link

[BUG] Regression in nyc_taxis desc_sort_tip_amount between 2.16 and 2.17 #16220

Open peteralfonsi opened 1 week ago

peteralfonsi commented 1 week ago

Describe the bug

desc_sort_tip_amount in nyc_taxis is this query:

"body": {
        "query": {
          "match_all": {}
        },
        "sort" : [
          {"tip_amount" : "desc"}
        ]
      }

Using OSB, its p50 for latency in 2.16 was 9.6 ms but in 2.17 it was 36.8 ms. Other percentiles are similarly affected. This happens consistently across different runs. It looks like the nightly benchmarks don't run this operation.

asc_sort_tip_amount is the same query with "asc" sort order. It's not affected. Its p50 went from 7.7 to 7.4 ms.

Related component

Search:Performance

To Reproduce

Create tar install on 2.16 or 2.17 branch using ./gradlew assemble
Run OpenSearch in a c5.xl instance with 4 GB heap size
Run nyc_taxis against the cluster using OSB and see differences

Expected behavior

The latencies should be at par.

Additional Details

Plugins No plugins

Host/Environment (please complete the following information): AL2 on c5.xl instance type, using tar install of OpenSearch built from 2.16 or 2.17 branches

peteralfonsi commented 1 week ago

When doing more testing, I found in some runs of the workload we don't see the regression.

If this is the case, and we do more runs without recreating the index (by skipping those operations in OSB with a command like opensearch-benchmark execute-test --workload-path=/home/ec2-user/osb/opensearch-benchmark-workloads/nyc_taxis --workload-params='{"bulk_indexing_clients":16}' --target-host=http://localhost:9200/ --exclude-tasks=delete-index,create-index,check-cluster-health,index), those subsequent runs will also not have the regression. But, when we recreate the index again, the regression reappears. So it seems to have something to do with indexing?

So far I've seen the regression about 4 out of 5 times that I've run OSB with a new index.

sandeshkr419 commented 1 week ago

@rishabh6788 Is this the same query shape that you have been investigating?

If yes, can you share your experimentation details here.

getsaurabh02 commented 1 week ago

@peteralfonsi can we try running with one client and see if this is consistently reproducible? The ordering of data with multiple clients can create another variable here, so eliminating that will be great.

peteralfonsi commented 6 days ago

I did 4 runs with "bulk_indexing_clients":1. 3 of them had the regression (2 in the ~35 ms range, 1 in the ~25 ms range) and 1 didn't (~9 ms).

Will post flamegraphs for 2.16 vs 2.17 tomorrow.

peteralfonsi commented 5 days ago

Ok, I've now seen the regression happen on 2.16 as well. Probably this is unrelated to the version change, and has something to do with how it's indexed.