Closed therji closed 5 years ago
@therji What's going here is that the word "Carrot" is treated as less significant by the algorithm, because it occurs in more documents. It's kind of like searching for the word "the" (ignoring the fact that "the" is an English stopword) - it occurs so frequently within English documents that matching "the" in a query contributes very little towards a document's score.
That was my guess too, but even if Carrot is discounted, shouldn't it still be a net positive? Another way to put it, id 3 is a superset of id 2, so I still don't understand why, unless Apple and Banana were discounted in addition to Carrot, but just for id 3?
I believe @hoelzro is correct, the IDF score for carrot
is quite low because it takes into consideration its presence in all the fields of the documents, in this case documentsWithTerm
would be 4 since it's found in the title of one document and the body of all three:
(3 - 4 + 0.5) / (4 + 0.5) = -0.11111
log(1 + abs(-0.1111)) = 0.10536051565782635
I believe the difference in versions is a product of this change.
You have to consider that Lunr is built for natural language documents, your example is quite specific so this apparently unintuitive result is not a reflection on the quality of the library.
@therji I believe lunr also scales scores based on the length of the matching field - longer fields get a lower score.
Thanks, makes more sense now.
Here is the example:
I would expect id 3 to be higher than id 2. However, the order I get back is 2,3,1.
I did some digging, and it seems like it was ordered 3,2,1 back in version 2.2.1, but changed to 2,3,1 in version 2.3.0. I don't understand the algorithm well enough, but I feel it's hard to justify why id 2 is ranked higher than id 3?
JSFiddle for v2.3.6 (latest): https://jsfiddle.net/7rbx3asu JSFiddle for v2.2.1: https://jsfiddle.net/7rbx3asu/3/