meilisearch / milli

Search engine library for Meilisearch ⚡️
MIT License
464 stars 82 forks source link

Enhance word splitting strategy #662

Closed akki1306 closed 1 year ago

akki1306 commented 1 year ago

Pull Request

Related issue

Fixes #648

What does this PR do?

PR checklist

Please check if your PR fulfills the following requirements:

Thank you so much for contributing to Meilisearch!

akki1306 commented 1 year ago

Hi @ManyTheFish, I have addressed the comments. Please take a look.

bors[bot] commented 1 year ago

Build succeeded:

meili-bot commented 1 year ago

This message is sent automatically

Thank you for contributing to Meilisearch. If you are participating in Hacktoberfest, and you would like to receive some gift from Meilisearch too, please complete this form.

akki1306 commented 1 year ago

Thanks @ManyTheFish, I was trying to understand the code regarding the single use case for the word_documents_count method. What I understood is while resolving the primitive query part we are considering word pairs with the highest frequency whereas if the TermMatchingStrategy is set to Frequency, we use word_documents_count. Please let me know if the above understanding is correct or if you could throw some light in this direction, was curious to know about the construction of the query tree.

ManyTheFish commented 1 year ago

@akki1306, not really. Sometimes, an End User skips a whitespace while typing his search query, it's probably an oversight. To catch this oversight and avoid the End User rewriting his query, we try to split each searched word based on the subwords frequency. For instance, in the word bypassword, the EndUser forgot a space, but where? The function you enhanced has this Job, split a word into 2 in order to find the most relevant result. More precisely, the previous version was searching the frequency of the 2 subwords independently, let's imagine that bypass and word were both more frequent than by and password but the frequency of seeing near together is really low, the previous version would choose bypass and word despite the chance to find anything relevant with this split. Your modification takes now into account the frequency of the word near together.

The TermMatchingStrategy is another story, it defines the rule: "In which order do we ignore query words to retrieve more documents?":

In the third case, the interesting frequency is the frequency of the word alone and not near another word.

akki1306 commented 1 year ago

Thanks for the detailed explanation @ManyTheFish, it really helped improve my understanding of the query tree construction functionality 👍