Support for Proximity Search with Wildcards in Query string queries

kwebergithub commented 2 years ago

Is your feature request related to a problem? Please describe. We are switching from the dtSearch engine to OpenSearch and are surprised to find that Query string queries do not support several types of proximity searches. For example: "a b"~10 works, but "a b*"~10 returns inaccurate results (because of the wildcard).

Similarly "a (b OR c)"~10 "a (b AND c)"~10 all return inaccurate results while the dtSearch engine correctly finds the documents.

We are working in the legal space and proximity searches with wildcards are a must have requirement. There is no way for the user to express the search in a different way so that they would get the correct results, i.e. there is no work-around for the end user.

Describe the solution you'd like We would like to see full support for proximity searches added to OpenSearch.

Describe alternatives you've considered We are working on parsing the query string using ANTLR and dynamically generating span JSON queries to run these proximity searches in Elasticsearch. This approach seems complex and error prone.

Additional context We are working with AWS on OpenSearch and the team expressed their support for the legal industry. They requested that we file a feature request here as the best first step.

Thanks! Let us know if you need any additional information.

rishabhmaurya commented 2 years ago

I believe your requirement is similar to what lucene supports in ComplexPhraseQuery. Some examples of using wildcard in proximity phrase query. Unfortunately, I do not see support of it in opensearch (related issue in elasticsearch). As recommended, it should be fairly simple to implement a plugin and use ComplexPhraseQuery. Also, MultiPhrase query today doesn't supports wildcards in opensearch.

Other option to explore would be lucene SpanNearQuery, which are supported in opensearch. E.g. -

curl -XGET 'http://localhost:9200/test/_search?pretty' -H "Content-Type: application/json" -d '{
    "query": {
        "span_near" : {
            "clauses" : [
                { "span_term" : { "content" : "brown" } },
                { "span_term" : { "content" : "quick" } },
                { "span_multi" : { "match": { "wildcard": { "content" : "jum*" } } } }
            ],
            "slop" : 2,
            "in_order" : true
        }
    }
}'

You can adjust the slop and in_order. But this needs some parsing logic at client to split it into terms and use span_multi with wildcard wherever necessary. Again, this logic too can be implemented as a plugin. No matter which solution you choose, be mindful of -

performance of query with wildcard. Opensearch doesn't allow using wildcards as these queries are expensive in nature, and you are required to set - "search.allow_expensive_queries" to true explicitly. When you use PhraseQuery with wildcards, lucene internally expands all wildcard terms and query could explode. Also, SpanNear or such phrase queries with non-zero slop would load positional attributes with each hit, which is again an overhead and expensive operation.
Phrase queries doesn't work well with analyzers making use of synonym or any such filter and you may see unexpected results. There is an open issue in lucene.

A nicer way to solve this problem would be use of automaton like TermAutomatonQuery. I'm not sure how efficient it is and if it is still in experimental phase as LUCENE-3843 is still open and it may give unexpected results. There is a nice blog on it. This is again something for which you will have to write a plugin as its not supported in opensearch. If it works for your usecases, great! Given there are several alternatives available by writing a plugin for such usecases, adding it to the core doesn't makes lot of sense at this point, unless its needed by several other customers too.

kwebergithub commented 2 years ago

Thanks for the feedback. We are working with the span query, but the approach involves parsing the user query using ANTLR with what could be a never-ending amount of support/testing given the number of combinations of user inputs.

As mentioned earlier this capability is critical in the legal space. The industry leader dtSearch had this functionality “out of the box”. I think the demand for this functionality is only going to increase. AWS has mentioned that eDiscovery was a strategic area for them. I think it could be a significant differentiator for Opensearch over Elasticsearch and something we hope AWS will keep pushing for.

Much appreciated.

opensearch-project / OpenSearch

Support for Proximity Search with Wildcards in Query string queries #1893