opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.61k stars 1.76k forks source link

[BUG] Poor performance of regex queries #5097

Open Jon-AtAWS opened 1 year ago

Jon-AtAWS commented 1 year ago

Describe the bug

The following query:

"simple_query_string": {
    "query": "/.*innerTerm.*/",
    "fields": [],
    "type": "best_fields",
    "default_operator": "or",
    "max_determinized_states": 10000,
    "enable_position_increments": true,
    "fuzziness": "AUTO",
    "fuzzy_prefix_length": 0,
    "fuzzy_max_expansions": 50,
    "phrase_slop": 0,
    "escape": false,
    "auto_generate_synonyms_phrase_query": true,
    "fuzzy_transpositions": true,
    "boost": 1
 }

Is inefficient and runs for many seconds, especially when there are many fields in the index. When there are also many indices in the query, this can take down the cluster.

Expected behavior

This is one clause of a larger query, but because of the way queries are processed, there's no way to force the filtering to occur first. In this case, we can't use a post-filter, because we need accurate values in the aggregations.

We'd like to reduce the overall cost of running the query.

dblock commented 1 year ago

@Jon-AtAWS Do you have any suggestions of how we could express the order of filtering we'd like in this query to let it happen both ways?

Jon-AtAWS commented 1 year ago

So, let's say this is part of a larger query, like (p-code, excuse the lack of syntax)

bool
  filter
    bool
      must
        some time range
        some field filter
        simple-query-string: regex

And, let's say I know that the time range and field filter will reduce the match set to a few hundred docs, vs. evaluating the cost of the regex as part of the query planning.

Just brainstorming...

  1. We could add an "execution_order": or similar to the clauses. This is likely too hard to guarantee
  2. We could add something like "weight": that gets passed in to Lucene to either substitute for the cost function or modify it
  3. We could add something time related like "stop_after":

We can also work on how Lucene approximates regex queries, to increase their weight an order of magnitude or 2.

msfroh commented 1 year ago

This may benefit from https://github.com/opensearch-project/OpenSearch/issues/7057.

Alternatively, it might have already improved in newer versions of Lucene, since (IIRC) it no longer precomputes a bitset of all matching docs across all fields.

Jon-AtAWS commented 1 year ago

The release of search pipelines fixes this problem in one way. We should evaluate the cost of splitting out filters like the above and managing pipeline flow vs. pushing them all into a single query and having Lucene optimize.

msfroh commented 4 months ago

We're also adding support for the wildcard field type to make these queries better (if you plan your mappings around wildcard matching): https://github.com/opensearch-project/OpenSearch/issues/5639