Closed jobergum closed 2 years ago
Also this would support running multiple weakAnd query operators over different fieldsets/defaultIndex using the same userInput().
weakAnd YQL userQuery grammar and parser option added in https://github.com/vespa-engine/vespa/pull/20716
Documentation of weakAnd option: https://github.com/vespa-engine/documentation/pull/1744
The Vespa Wand (weakAnd in YQL) is a powerful cheap recall function which in many scenarios have perfect recall but with a fraction of the cost doing brute force any/or: (https://towardsdatascience.com/learning-from-unlabelled-data-with-covid-19-open-research-dataset-cded4979f1cf) but it's currently not straight forward to use as it does not handle userInput/userQuery easily so middle tiers needs to tokenize the user input or write java plugins to manipulate the parsed query tree inside the Vespa search container.
Consider that the user has entered the following query in a free text search form:
'the covid incubation period'
With simple white space tokenization we can do where weakAnd(default contains "the",default contains "covid", default contains "incubation", default contains "period");
https://api.cord19.vespa.ai/search/?yql=select%20title,abstract%20from%20sources+*%20where%20weakAnd(default%20contains%20%22the%22,default%20contains%20%22covid%22,default%20contains%20%22incubation%22,%20default%20contains%20%22period%22)%3B&hits=1&tracelevel=3
The above approach works but is error prone as linguistic processing is happening outside of Vespa (tokenization) and could lead to asymmetric behaviour. Using userInput() allows Vespa to parse the query and to do the linguistic processing with less chance of asymmetric tokenization.
https://api.cord19.vespa.ai/search/?yql=select%20title,abstract%20from%20sources+*%20where%20weakAnd(%5B%7B%22grammar%22:%22any%22%7D%5DuserInput(@inputQuery))%3B&hits=1&tracelevel=3&inputQuery=the+covid+incubation+period
However the parsed weakAnd expression uses the entire input as the first argument of weakAnd and the totalCount becomes 2x larger then using the weakAnd properly with multiple arguments.
Having a grammar weakAnd for userInput/userQuery would produce the same query tree as in when pre-tokenized by the middle tier
https://api.cord19.vespa.ai/search/?yql=select%20title,abstract%20from%20sources+*%20where%20weakAnd(%5B%7B%22grammar%22:%22weakAnd%22%7D%5DuserInput(@inputQuery))%3B&hits=1&tracelevel=3&inputQuery=the+covid+incubation+period