opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
60 stars 64 forks source link

[BUG] Upgrade from OpenSearch 2.11 -> 2.13 breaks working boolean+hybrid query, hybrid+request_processor combinations #903

Closed sutsr closed 4 days ago

sutsr commented 1 week ago

What is the bug?

I have two query types that ran fine on OpenSearch 2.11, but after upgrading to 2.13, error with illegal argument exceptions of hybrid query must be a top level query and cannot be wrapped into other queries.

Request processor used to combine a criterion with a hybrid search

POST /hybrid_search_index1/_search
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text": {
              "query": "family"
            }
          }
        },
        {
          "neural": {
            "embedding": {
              "query_text": "family",
              "model_id": “model_id_here“,
              "k": 5
            }
          }
        }
      ]
    }
  },
  "search_pipeline": {
    "request_processors": [
      {
        "filter_query": {
          "query": {
            "range": {
              "publish_date": {
                "lte": "2024-01-01”
              }
            }
          }
        }
      }
    ],
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": {
            "technique": "l2"
          },
          "combination": {
            "technique": "arithmetic_mean",
            "parameters": {
              "weights": [
                0.7,
                0.3
              ]
            }
          }
        }
      }
    ]
  }
}

Top level boolean used to combine an exclusion criterion with the results of a hybrid search

POST /hybrid_search_index1/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "is_flagged": true
          }
        }
      ],
      "must": [
        {
          "hybrid": {
            "queries": [
              {
                "match": {
                  "text": {
                    "query": “family
                  }
                }
              },
              {
                "neural": {
                  "embedding": {
                    "query_text": “family”,
                    "model_id": “model_id_here”,
                    "k": 5
                  }
                }
              }
            ]
          }
        }
      ]
    }
  },
  "search_pipeline": {
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": {
            "technique": "l2"
          },
          "combination": {
            "technique": "arithmetic_mean",
            "parameters": {
              "weights": [
                0.7,
                0.3
              ]
            }
          }
        }
      }
    ]
  }
}

How can one reproduce the bug?

Attempt to combine a hybrid query with a search pipeline request processor or use it as part of a boolean query in the fashion shown by the example queries.

The index used for these queries has the following fields:

What is the expected behavior?

In the first instance, I expect the hybrid query to run, only returning results that meet the additional criterion from the request processor.

In the second instance I expect the query to run and return results that meet both conditions.

What is your host/environment?

AWS Managed OpenSearch

Do you have any screenshots?

Not applicable

Do you have any additional context?

The model used for neural search is an external model calling out to OpenAI's ada-002 embedding API configured as per the blueprint

yuye-aws commented 1 week ago

I think @martin-gaievski has more context knowledge on the hybrid query.

martin-gaievski commented 1 week ago

@sutsr the behavior you're seeing in 2.13 is expected, we have closed the possibility of having hybrid query nested/wrapped into some other query. The change has been introduced in 2.12, https://github.com/opensearch-project/neural-search/pull/498. Main reason for that was the current approach for running hybrid query, it does the normalization in form of post processing of all shard level results. Regular query does it at the shard level, and those scores will be changed by the hybrid query anyway.

sutsr commented 1 week ago

Thanks for the clarification and explanation @martin-gaievski. Can you offer some guidance on how best to rework queries along the lines of my example queries?

Should a boolean query inside post_filter be used to replace any queries previously combined as in my examples (using bool wrapping hybrid or queries inside request_processors)? Essentially I am trying to provide result ranking from the hybrid query with knock-out exclusion on non-ranking-related fields.

e.g.

POST /hybrid_search_index1/_search
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text": {
              "query": "family"
            }
          }
        },
        {
          "neural": {
            "embedding": {
              "query_text": "family",
              "model_id": “model_id_here“,
              "k": 5
            }
          }
        }
      ]
    }
  },
  "search_pipeline": {
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": {
            "technique": "l2"
          },
          "combination": {
            "technique": "arithmetic_mean",
            "parameters": {
              "weights": [
                0.7,
                0.3
              ]
            }
          }
        }
      }
    ]
  },
  "post_filter": {
    "bool": {
      "filter": [
        {
          "range": {
            "publish_date": {
              "lte": "2024-01-1"
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "is_flagged": true
          }
        }
      ]
    }
  },
}
martin-gaievski commented 4 days ago

@sutsr right, post filter should solve your use case, it does not considered to be a wrapper for the hybrid query. As long you don't need exact number of results and ok with post processing of the main query results post filters is a good choice.