Replaced stream.findFirst by `for` loop for hybrid query

martin-gaievski commented 5 months ago

Description

Hybrid query is generally slower than other compound queries with similar child sub-queries/clauses. For instance if compared to Boolean it can be up to 12 times slower, depending on the dataset, query and index/cluster configuration. Check results of benchmark that I took for released 2.13 using noaa OSB workload, all time is in ms:

One sub-query that selects 11M documents

Bool: 98.1014
Hybrid: 973.683

One sub-query that selects 1.6K documents

Bool: 181.046
Hybrid: 90.1155

Three sub-query that select 15M documents

Bool: 117.505
Hybrid: 1458.8

Based on results of profiling most of the CPU time (35 to 40%) is taken by Stream.findFirst call in HybridQueryScorer.

That code is executed for each document returned by each of sub-query. That explains much longer execution time for queries that return larger sub-sets of a dataset.

That section of the code can be optimized to a plain for loop, plus the list of Integer is replaced by the plain array of ints. After optimization same code section takes 5 to 8% of overall execution time. Total time for clean hybrid query has been decreased 3-4 times for large sub-sets.

Below are detailed results for the same workload:

One sub-query that selects 11M documents

Bool: 84.7201
Hybrid: 256.799

One sub-query that selects 1.6K documents

Bool: 89.6258
Hybrid: 85.4563

Three sub-query that select 15M documents

Bool: 90.1481
Hybrid: 326.331

following were bool queries used in testing

Query 1
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "term": {
                          "station.country_code": "JA"
                      }
                  },
                  {
                      "range": {
                          "TRANGE": {
                              "gte": 0,
                              "lte": 30
                          }
                      }
                  },
                  {
                    "range": {
                        "date": {
                            "gte": "2016-06-04",
                            "format":"yyyy-MM-dd"
                        }
                    }
                  }
              ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                          "TRANGE": {
                              "gte": -100,
                              "lte": -50
                          }
                      }
                  }
              ]
          }
        }

Query 3
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                        "TRANGE": {
                          "gte": 1,
                          "lte": 35
                        }
                      }
                  }
              ]
          }

equivalent hybrid queres are:

Query 1
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "term": {
                        "station.country_code": "JA"
                    }
                },
                {
                    "range": {
                        "TRANGE": {
                            "gte": 0,
                            "lte": 30
                        }
                    }
                },
                {
                  "range": {
                      "date": {
                          "gte": "2016-06-04",
                          "format":"yyyy-MM-dd"
                      }
                  }
                }
            ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                        "TRANGE": {
                          "gte": -100,
                          "lte": -50
                        }
                    }
                }
            ]
          }

Query 3
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                      "TRANGE": {
                        "gte": 1,
                        "lte": 35
                      }
                    }
                }
            ]
          }

Issues Resolved

https://github.com/opensearch-project/neural-search/issues/705

Check List

[x] All tests pass
[X] Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 58.82353% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 84.74%. Comparing base (cc6a6b2) to head (0376e9b). Report is 12 commits behind head on main.

:exclamation: Current head 0376e9b differs from pull request most recent head e34a7bc. Consider uploading reports for the commit e34a7bc to get more accurate results

Files	Patch %	Lines
...ensearch/neuralsearch/query/HybridQueryScorer.java	58.82%	4 Missing and 3 partials :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #706 +/- ## ============================================ + Coverage 84.04% 84.74% +0.69% - Complexity 744 774 +30 ============================================ Files 59 59 Lines 2313 2412 +99 Branches 374 405 +31 ============================================ + Hits 1944 2044 +100 + Misses 214 203 -11 - Partials 155 165 +10 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

navneet1v commented 5 months ago

This is an interesting improvement for such a small change. Really liked the deep-dive here.

Few things I would like you to do here just for bookkeeping purpose:

The numbers and details which are quoted in the description please update them in the Issue.
[Next step] Can you look at other places where we might be using streams in the hybrid query clause which may be adding some extra latency?

vibrantvarun commented 5 months ago

Great catch @martin-gaievski . LGTM.

chishui commented 5 months ago

This is a great improvement, curious to know if you have checked the profiling after this optimization to see if the CPU usage goes down and which one is the next culprit to CPU usage

opensearch-trigger-bot[bot] commented 5 months ago

The backport to 2.13 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.13 2.13
# Navigate to the new working tree
cd .worktrees/backport-2.13
# Create a new branch
git switch --create backport/backport-706-to-2.13
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b277b07e89e25f1abaa9de3a326fda3556dc8a77
# Push it to GitHub
git push --set-upstream origin backport/backport-706-to-2.13
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.13

Then, create a pull request where the base branch is 2.13 and the compare/head branch is backport/backport-706-to-2.13.

martin-gaievski commented 5 months ago

This is an interesting improvement for such a small change. Really liked the deep-dive here.

Few things I would like you to do here just for bookkeeping purpose:

The numbers and details which are quoted in the description please update them in the Issue.

[Next step] Can you look at other places where we might be using streams in the hybrid query clause which may be adding some extra latency?

Done
Sure, we need to scan the code for similar calls. I haven't noticed any other places in code with similar problem that significantly add to performance of the query. For now the next step is to exclude TopDocsCollector from the flow when there is a hybrid query, that section has second highest latency.

martin-gaievski commented 5 months ago

This is a great improvement, curious to know if you have checked the profiling after this optimization to see if the CPU usage goes down and which one is the next culprit to CPU usage

Next slowest section after Stream.findFirst is extra doc collector. It's added by core for all the queries it consumes resources (per my results it's from 40 to 75% of CPU time) but for hybrid query those results are just ignored. Current feasible approach requires changes in both core and the plugin. I need to check how CPU usage is after the fix.

opensearch-project / neural-search