vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.47k stars 584 forks source link

Support grammar:weakAnd with userInput() for usage with weakAnd() operator #13076

Closed jobergum closed 2 years ago

jobergum commented 4 years ago

The Vespa Wand (weakAnd in YQL) is a powerful cheap recall function which in many scenarios have perfect recall but with a fraction of the cost doing brute force any/or: (https://towardsdatascience.com/learning-from-unlabelled-data-with-covid-19-open-research-dataset-cded4979f1cf) but it's currently not straight forward to use as it does not handle userInput/userQuery easily so middle tiers needs to tokenize the user input or write java plugins to manipulate the parsed query tree inside the Vespa search container.

Consider that the user has entered the following query in a free text search form:

'the covid incubation period'

With simple white space tokenization we can do where weakAnd(default contains "the",default contains "covid", default contains "incubation", default contains "period");

https://api.cord19.vespa.ai/search/?yql=select%20title,abstract%20from%20sources+*%20where%20weakAnd(default%20contains%20%22the%22,default%20contains%20%22covid%22,default%20contains%20%22incubation%22,%20default%20contains%20%22period%22)%3B&hits=1&tracelevel=3

"YQL+ query parsed: [select abstract, title from sources * where weakAnd(default contains "the", default contains "covid", default contains "incubation", default contains "period") limit 1 timeout 2000;]"

The above approach works but is error prone as linguistic processing is happening outside of Vespa (tokenization) and could lead to asymmetric behaviour. Using userInput() allows Vespa to parse the query and to do the linguistic processing with less chance of asymmetric tokenization.

https://api.cord19.vespa.ai/search/?yql=select%20title,abstract%20from%20sources+*%20where%20weakAnd(%5B%7B%22grammar%22:%22any%22%7D%5DuserInput(@inputQuery))%3B&hits=1&tracelevel=3&inputQuery=the+covid+incubation+period

However the parsed weakAnd expression uses the entire input as the first argument of weakAnd and the totalCount becomes 2x larger then using the weakAnd properly with multiple arguments.

"YQL+ query parsed: [select abstract, title from sources * where weakAnd((default contains "the" OR default contains "covid" OR default contains "incubation" OR default contains "period")) limit 1 timeout 2000;]"
},
query=[WAND(100) (OR default:the default:covid default:incubation default:period)]

Having a grammar weakAnd for userInput/userQuery would produce the same query tree as in when pre-tokenized by the middle tier

https://api.cord19.vespa.ai/search/?yql=select%20title,abstract%20from%20sources+*%20where%20weakAnd(%5B%7B%22grammar%22:%22weakAnd%22%7D%5DuserInput(@inputQuery))%3B&hits=1&tracelevel=3&inputQuery=the+covid+incubation+period

jobergum commented 4 years ago

Also this would support running multiple weakAnd query operators over different fieldsets/defaultIndex using the same userInput().

bratseth commented 2 years ago

weakAnd YQL userQuery grammar and parser option added in https://github.com/vespa-engine/vespa/pull/20716

bratseth commented 2 years ago

Documentation of weakAnd option: https://github.com/vespa-engine/documentation/pull/1744