lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.67k stars 133 forks source link

Exact Match Scoring #204

Closed oasiz closed 1 year ago

oasiz commented 1 year ago

Take the following example:

let ms = new MiniSearch({
  fields: ['title', 'artist'], // fields to index for full-text search
  storeFields: ['title', 'artist'] // fields to return with search results
})

let docs = [{
    id: 1,
    artist: "Queen",
    title: "Don't Stop Me Now"
  },
  {
    id: 2,
    artist: "ABBA",
    title: "Dancing Queen"
  }
]

ms.addAll(docs)

let result = ms.search('Queen');

console.log(result);

Results:

[{
  artist: "ABBA",
  id: 2,
  match: {
    queen: ["title"]
  },
  score: 1.1753365235581683,
  terms: ["queen"],
  title: "Dancing Queen"
}, {
  artist: "Queen",
  id: 1,
  match: {
    queen: ["artist"]
  },
  score: 1.0397207708399179,
  terms: ["queen"],
  title: "Don't Stop Me Now"
}]

Any reason why "Dancing Queen" scores higher than the exact match "Queen", and is there a way to favour exact matches?

As always, thanks for your help!

lucaong commented 1 year ago

Hi @oasiz , the reason is due to the BM25+ scoring algorithm taking into consideration the field length in the matching document versus the average length of the field in all documents. In this case, the average length of the artist field is 1 term, while the average length of the title field is 3.5. Simplifying, "queen" matches 1 out of 2 terms in "Dancing Queen", but on average the artist field 3.5 terms long, so this match gets a boost, while the match in title does not get such boost, as the length is the same as the average.

The formula is more complex than that (you can read about it here), but the essential point is that a match in a field that is shorter than the average gets a boost (see the |D|/avgdl factor in the denominator of the formula).

If you add more documents, changing the average length of fields, you'll see that the order of results changes, even if the results are the same:

ms.addAll([
  { id: 3, artist: 'Red Hot Chili Peppers', title: 'Californication' },
  { id: 4, artist: 'Last Shadow Puppet', title: 'Miracle Aligner' }
])

ms.search('Queen')
// =>
// [
//   {
//     id: 1,
//     score: 2.1301057307305027,
//     terms: [ 'queen' ],
//     match: { queen: [Array] },
//     title: "Don't Stop Me Now",
//     artist: 'Queen'
//   },
//   {
//     id: 2,
//     score: 1.905500265114277,
//     terms: [ 'queen' ],
//     match: { queen: [Array] },
//     title: 'Dancing Queen',
//     artist: 'ABBA'
//   }
// ]

Note that in this case the effect of adding documents is more pronounced because there are so few documents in the index. On larger indices, the impact of adding more documents would be less prominent.

In theory, the parameters of the BM25+ scoring algorithm can be tweaked. For example, setting the b parameter to a lower value would give less importance to the field length. In practice though, I would not advise to change them, unless one really understands the implications, because it’s not trivial to optimize such parameters without making the results worse in other cases.

lucaong commented 1 year ago

Hi @oasiz I will close this issue for now as I think the question was answered, but feel free to comment on it, and I will reopen it if necessary.