The bug occurs because during a phrase search we force all the word to be present, however, stop words are not kept during the indexing process, and so, documents that initially contains stop words can't be retrieved via a stop word at search time.
This way, a phrase search will act as if it is searching for an unknown word and will not retrieve any document.
Technical approach
Change the Phrase Operation of the query tree from a Vec<String> into a Vec<Option<String>> where None corresponds to a word classified as a stop word by create_primitive_query().
Then ignore it with a simple filter_map everywhere unless in resolve_phrase.
resolve_phrase should take into account the distance between words separated by a stop_word and should add 1 to the computed distance for each stop word between phrase words.
If an index has
stop_words
set in the settings, then, any Phrase search containing at least one stop word will return an empty response.Fix
Minimal test to add in /milli/src/tests/search/phrase_search.rs
below is a gist link to the file phrase_search.rs containing a test that showcases the bug: https://gist.github.com/ManyTheFish/f840e37cb2d2e029ce05396b4d540762
Explanation
The bug occurs because during a phrase search we force all the word to be present, however, stop words are not kept during the indexing process, and so, documents that initially contains stop words can't be retrieved via a stop word at search time. This way, a phrase search will act as if it is searching for an unknown word and will not retrieve any document.
Technical approach
Change the Phrase Operation of the query tree from a
Vec<String>
into aVec<Option<String>>
whereNone
corresponds to a word classified as a stop word bycreate_primitive_query()
. Then ignore it with a simplefilter_map
everywhere unless inresolve_phrase
.resolve_phrase
should take into account the distance between words separated by astop_word
and should add 1 to the computed distance for each stop word between phrase words.