olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 545 forks source link

Odd results when using Lunr v2.3.0 with lunr-languages #358

Closed yeraydiazdiaz closed 6 years ago

yeraydiazdiaz commented 6 years ago

Hi Oliver,

while porting Lunr.py to include the recent changes in Lunr v2.3.0 I noticed a big difference between the results when using lunr-languages on the same corpus from v2.2.1 and below.

I put together a couple of codesandboxes to showcase the issue, on both I'm using the same corpus of spanish documents and searching for the same term "imperio":

Note both the order and the scores are different. The difference in the scores is to be expected, but the order of the documents what's interesting. Here's a summary of the term presence in the corpus:

v2.2.1 returns e, d, f, g. v2.3.0 returns d, f, g, e.

Note e scores last on v2.3.0 which is unexpected.

The results in English between the the Javascript and Python versions still match closely in v2.3.0, it's only when using language support that they diverge. Lunr.py targetting v2.3.0 with language support returns results closer to v2.2.1:

{'ref': 'd', 'score': 1.047, 'match_data': <MatchData "imperi">},
{'ref': 'e', 'score': 0.738, 'match_data': <MatchData "imperi">},
{'ref': 'f', 'score': 0.732, 'match_data': <MatchData "imperi">},
{'ref': 'g', 'score': 0.679, 'match_data': <MatchData "imperi">}

Though I'm not sure why e scores so low.

/cc @MihaiValentin

olivernn commented 6 years ago

This is the kind of thing that always makes me hesitant to modify how documents are scored! Its so difficult to actually test relevancy.

I took a look at the results and I don't think "imperio" is stemming how you think it is. If you look at the result match data for "e" you see that the term "imperi" is only found in the title. Whether that means it should appear last is debatable though. At a guess I would say that the title should carry more weight, but that should probably be handled by field boosts.

I'm pretty sure nothing stemmer related changed in the newest Lunr release, so this means that the scoring changes are now less sensitive to the length of a field. I'm not yet sure if that is good or bad.

There are some tests in Lunr that try and ensure that results are ranked correctly, but testing search relevancy is quite hard. For a while now I've been looking for a better dataset to use for these tests. A data set that includes queries and expected ranking is not easy to come by however.

P.S Thank you for providing an interesting set of test data, I enjoyed trying to recall my high school Spanish when reading them!

yeraydiazdiaz commented 6 years ago

Glad you liked the test data ๐Ÿ˜„ I figured it'd be easier to understand the text if it was something familiar.

You're right, turns out I expected the stemmer to turn imperial and imperiales, the two instances in e, to imperi which is the stem of imperio, turns out it doesn't, hence e scoring lower.

After tracing both versions it appears the changes in Vector.similarity make up for all the difference, given that the results in v2.3.0 are actually more coherent I'm going to close this.

Thanks for your help ๐Ÿ™‚

olivernn commented 6 years ago

Yeah, I think the way the similarity method worked before was probably wrong. I mean, it worked, but made boosts, either at query-time or build-time, mostly ineffective.

I'm glad the new scoring/ranking seems to be more coherent, let me know if anything else crops up while your upgrading Lunr.py.

yeraydiazdiaz commented 6 years ago

That was the only โ€œissueโ€ actually, everything else went smoothly. Just released 0.4.0 targeting Lunr.js 2.3.0 ๐Ÿ™‚