Odd results when using Lunr v2.3.0 with lunr-languages

yeraydiazdiaz commented 6 years ago

Hi Oliver,

while porting Lunr.py to include the recent changes in Lunr v2.3.0 I noticed a big difference between the results when using lunr-languages on the same corpus from v2.2.1 and below.

I put together a couple of codesandboxes to showcase the issue, on both I'm using the same corpus of spanish documents and searching for the same term "imperio":

Note both the order and the scores are different. The difference in the scores is to be expected, but the order of the documents what's interesting. Here's a summary of the term presence in the corpus:

e contains the term in the title and 2 times in the body
d contains the term 3 times in the body
f and g contain the term once in the body

v2.2.1 returns e, d, f, g. v2.3.0 returns d, f, g, e.

Note e scores last on v2.3.0 which is unexpected.

The results in English between the the Javascript and Python versions still match closely in v2.3.0, it's only when using language support that they diverge. Lunr.py targetting v2.3.0 with language support returns results closer to v2.2.1:

{'ref': 'd', 'score': 1.047, 'match_data': <MatchData "imperi">},
{'ref': 'e', 'score': 0.738, 'match_data': <MatchData "imperi">},
{'ref': 'f', 'score': 0.732, 'match_data': <MatchData "imperi">},
{'ref': 'g', 'score': 0.679, 'match_data': <MatchData "imperi">}

Though I'm not sure why e scores so low.

/cc @MihaiValentin

olivernn commented 6 years ago

This is the kind of thing that always makes me hesitant to modify how documents are scored! Its so difficult to actually test relevancy.

I took a look at the results and I don't think "imperio" is stemming how you think it is. If you look at the result match data for "e" you see that the term "imperi" is only found in the title. Whether that means it should appear last is debatable though. At a guess I would say that the title should carry more weight, but that should probably be handled by field boosts.

I'm pretty sure nothing stemmer related changed in the newest Lunr release, so this means that the scoring changes are now less sensitive to the length of a field. I'm not yet sure if that is good or bad.

There are some tests in Lunr that try and ensure that results are ranked correctly, but testing search relevancy is quite hard. For a while now I've been looking for a better dataset to use for these tests. A data set that includes queries and expected ranking is not easy to come by however.

P.S Thank you for providing an interesting set of test data, I enjoyed trying to recall my high school Spanish when reading them!

yeraydiazdiaz commented 6 years ago

Glad you liked the test data 😄 I figured it'd be easier to understand the text if it was something familiar.

You're right, turns out I expected the stemmer to turn imperial and imperiales, the two instances in e, to imperi which is the stem of imperio, turns out it doesn't, hence e scoring lower.

After tracing both versions it appears the changes in Vector.similarity make up for all the difference, given that the results in v2.3.0 are actually more coherent I'm going to close this.

Thanks for your help 🙂

olivernn commented 6 years ago

Yeah, I think the way the similarity method worked before was probably wrong. I mean, it worked, but made boosts, either at query-time or build-time, mostly ineffective.

I'm glad the new scoring/ranking seems to be more coherent, let me know if anything else crops up while your upgrading Lunr.py.

yeraydiazdiaz commented 6 years ago

That was the only “issue” actually, everything else went smoothly. Just released 0.4.0 targeting Lunr.js 2.3.0 🙂

olivernn / lunr.js

Odd results when using Lunr v2.3.0 with lunr-languages #358