opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
57 stars 58 forks source link

In hybrid query replace Java Stream calls with a faster alternative #705

Closed martin-gaievski closed 2 months ago

martin-gaievski commented 2 months ago

As part of performance optimization for Hybriq query slow Java Stream API calls need to be replaced by faster alternatives.

For instance if compared to Boolean it can be up to 12 times slower, depending on the dataset, query and index/cluster configuration. Check results of benchmark that I took for released 2.13 using noaa OSB workload, all times are in ms:

One sub-query that selects 11M documents

Bool: 98.1014
Hybrid: 973.683

One sub-query that selects 1.6K documents

Bool: 181.046
Hybrid: 90.1155

Three sub-query that select 15M documents

Bool: 117.505
Hybrid: 1458.8

Based on results of profiling most of the CPU time (35 to 40%) is taken by Stream.findFirst call in HybridQueryScorer.

That code is executed for each document returned by each of sub-query. That explains much longer execution time for queries that return larger sub-sets of a dataset.

That section of the code can be optimized to a plain for loop, plus the list of Integer is replaced by the plain array of ints. After optimization same code section takes 5 to 8% of overall execution time. Total time for clean hybrid query has been decreased 3-4 times for large sub-sets.

Below are detailed results for the same workload:

One sub-query that selects 11M documents

Bool: 84.7201
Hybrid: 256.799

One sub-query that selects 1.6K documents

Bool: 89.6258
Hybrid: 85.4563

Three sub-query that select 15M documents

Bool: 90.1481
Hybrid: 326.331

following were bool queries used in testing

Query 1
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "term": {
                          "station.country_code": "JA"
                      }
                  },
                  {
                      "range": {
                          "TRANGE": {
                              "gte": 0,
                              "lte": 30
                          }
                      }
                  },
                  {
                    "range": {
                        "date": {
                            "gte": "2016-06-04",
                            "format":"yyyy-MM-dd"
                        }
                    }
                  }
              ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                          "TRANGE": {
                              "gte": -100,
                              "lte": -50
                          }
                      }
                  }
              ]
          }
        }

Query 3
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                        "TRANGE": {
                          "gte": 1,
                          "lte": 35
                        }
                      }
                  }
              ]
          }

equivalent hybrid queres are:

Query 1
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "term": {
                        "station.country_code": "JA"
                    }
                },
                {
                    "range": {
                        "TRANGE": {
                            "gte": 0,
                            "lte": 30
                        }
                    }
                },
                {
                  "range": {
                      "date": {
                          "gte": "2016-06-04",
                          "format":"yyyy-MM-dd"
                      }
                  }
                }
            ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                        "TRANGE": {
                          "gte": -100,
                          "lte": -50
                        }
                    }
                }
            ]
          }

Query 3
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                      "TRANGE": {
                        "gte": 1,
                        "lte": 35
                      }
                    }
                }
            ]
          }

Based on these benchmark results for 2.13 following calls need to be reworked:

Any other similar code should be changed to a faster alternative.

martin-gaievski commented 2 months ago

After first PR with optimization has been merged I take one more round of benchmarks and got following results. That is based on same 3 queries as in initial benchmark:

One sub-query that selects 11M documents

Bool: p50 88.2863 | p90 103.777
Hybrid: p50 299.427 | p90 319.797

One sub-query that selects 1.6K documents

Bool: p50 92.7222 | p90 98.6847
Hybrid: p50 94.5511 | p90 111.645

Three sub-query that select 15M documents

Bool: p50 98.0301 | p90 108.948
Hybrid: p50 475.319 | p90 515.999

326100530-a2beafc9-1deb-4f55-b85b-3c8260ab69bb

Most time (~28%) is taken by store and lookup of the index based on query as a key. Depending on the exact sub-query calculation of its hash code can be slow (hybrid query works with any type of OpenSearch query). That is a problem on large datasets as this is done for each doc by each sub-query.

We can avoid creation and usage of that query to index map by storing sub-query index at time we create collection of DISIWrapers.

I've done the change and got following results for the same 3 types of hybrid query. As per benchmark results that gives about 20% performance boost. I've run it on 2.13 using noaa OSB workload, all times are in ms:

One sub-query that selects 11M documents

Bool: p50 97.9306 | p90 116.299
Hybrid: p50 228.696 | p90 249.665

One sub-query that selects 1.6K documents

Bool: p50 87.3152 | p90 89.3061
Hybrid: p50 89.9654 | p90 92.349

Three sub-query that select 15M documents

Bool: p50 97.9891 | p90 114.396
Hybrid: p50 353.631 | p90 377.527

PR with corresponding change has been merged to main and 2.x branches https://github.com/opensearch-project/neural-search/pull/711