meilisearch / milli

Search engine library for Meilisearch ⚡️
MIT License
464 stars 82 forks source link

Phrase search containing stop words never retrieve any documents #661

Closed ManyTheFish closed 1 year ago

ManyTheFish commented 1 year ago

If an index has stop_words set in the settings, then, any Phrase search containing at least one stop word will return an empty response.

Fix

Minimal test to add in /milli/src/tests/search/phrase_search.rs

below is a gist link to the file phrase_search.rs containing a test that showcases the bug: https://gist.github.com/ManyTheFish/f840e37cb2d2e029ce05396b4d540762

Explanation

The bug occurs because during a phrase search we force all the word to be present, however, stop words are not kept during the indexing process, and so, documents that initially contains stop words can't be retrieved via a stop word at search time. This way, a phrase search will act as if it is searching for an unknown word and will not retrieve any document.

Technical approach

Change the Phrase Operation of the query tree from a Vec<String> into a Vec<Option<String>> where None corresponds to a word classified as a stop word by create_primitive_query(). Then ignore it with a simple filter_map everywhere unless in resolve_phrase. resolve_phrase should take into account the distance between words separated by a stop_word and should add 1 to the computed distance for each stop word between phrase words.