Strange results in exact match

olivernn / lunr.js

A bit like Solr, but much smaller and not as bright

http://lunrjs.com

MIT License

8.89k stars 548 forks source link

Strange results in exact match #394

Closed therji closed 5 years ago

therji commented 5 years ago

Here is the example:

var idx = lunr(function () {
  this.field('title', {boost: 10})
  this.field('body', {boost: 1})

  this.add({
    "title": "Apple",
    "body": "Apple Banana Carrot",
    "id": "1"
  });
  this.add({
    "title": "Apple Banana",
    "body": "Apple Banana Carrot",
    "id": "2"
  });
  this.add({
    "title": "Apple Banana Carrot",
    "body": "Apple Banana Carrot",
    "id": "3"
  });
})

var results = idx.search('Apple Banana Carrot');

I would expect id 3 to be higher than id 2. However, the order I get back is 2,3,1.

I did some digging, and it seems like it was ordered 3,2,1 back in version 2.2.1, but changed to 2,3,1 in version 2.3.0. I don't understand the algorithm well enough, but I feel it's hard to justify why id 2 is ranked higher than id 3?

JSFiddle for v2.3.6 (latest): https://jsfiddle.net/7rbx3asu JSFiddle for v2.2.1: https://jsfiddle.net/7rbx3asu/3/

hoelzro commented 5 years ago

@therji What's going here is that the word "Carrot" is treated as less significant by the algorithm, because it occurs in more documents. It's kind of like searching for the word "the" (ignoring the fact that "the" is an English stopword) - it occurs so frequently within English documents that matching "the" in a query contributes very little towards a document's score.

therji commented 5 years ago

That was my guess too, but even if Carrot is discounted, shouldn't it still be a net positive? Another way to put it, id 3 is a superset of id 2, so I still don't understand why, unless Apple and Banana were discounted in addition to Carrot, but just for id 3?

yeraydiazdiaz commented 5 years ago

I believe @hoelzro is correct, the IDF score for carrot is quite low because it takes into consideration its presence in all the fields of the documents, in this case documentsWithTerm would be 4 since it's found in the title of one document and the body of all three:

(3 - 4 + 0.5) / (4 + 0.5) = -0.11111
log(1 + abs(-0.1111)) = 0.10536051565782635

I believe the difference in versions is a product of this change.

You have to consider that Lunr is built for natural language documents, your example is quite specific so this apparently unintuitive result is not a reflection on the quality of the library.

hoelzro commented 5 years ago

@therji I believe lunr also scales scores based on the length of the matching field - longer fields get a lower score.

therji commented 5 years ago

Thanks, makes more sense now.