akki1306 commented 1 year ago

Pull Request

Related issue

Fixes #648

What does this PR do?

split_best_frequency to use frequency of word pairs near together with proximity value of 1 instead of considering the frequency of individual words. Word pairs having max frequency are considered.

PR checklist

Please check if your PR fulfills the following requirements:

[x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
[x] Have you read the contributing guidelines?
[x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

akki1306 commented 1 year ago

Hi @ManyTheFish, I have addressed the comments. Please take a look.

bors[bot] commented 1 year ago

Build succeeded:

meili-bot commented 1 year ago

This message is sent automatically

Thank you for contributing to Meilisearch. If you are participating in Hacktoberfest, and you would like to receive some gift from Meilisearch too, please complete this form.

akki1306 commented 1 year ago

Thanks @ManyTheFish, I was trying to understand the code regarding the single use case for the word_documents_count method. What I understood is while resolving the primitive query part we are considering word pairs with the highest frequency whereas if the TermMatchingStrategy is set to Frequency, we use word_documents_count. Please let me know if the above understanding is correct or if you could throw some light in this direction, was curious to know about the construction of the query tree.

ManyTheFish commented 1 year ago

@akki1306, not really. Sometimes, an End User skips a whitespace while typing his search query, it's probably an oversight. To catch this oversight and avoid the End User rewriting his query, we try to split each searched word based on the subwords frequency. For instance, in the word bypassword, the EndUser forgot a space, but where? The function you enhanced has this Job, split a word into 2 in order to find the most relevant result. More precisely, the previous version was searching the frequency of the 2 subwords independently, let's imagine that bypass and word were both more frequent than by and password but the frequency of seeing near together is really low, the previous version would choose bypass and word despite the chance to find anything relevant with this split. Your modification takes now into account the frequency of the word near together.

The TermMatchingStrategy is another story, it defines the rule: "In which order do we ignore query words to retrieve more documents?":

from the last typed word to the first?
from the first typed one to the last?
from the most frequent word to the least?
from the shortest word to the longest?

In the third case, the interesting frequency is the frequency of the word alone and not near another word.

akki1306 commented 1 year ago

Thanks for the detailed explanation @ManyTheFish, it really helped improve my understanding of the query tree construction functionality 👍

meilisearch / milli

Enhance word splitting strategy #662

Pull Request

Related issue

What does this PR do?

PR checklist