Closed oasiz closed 1 year ago
Hi @oasiz ,
the reason is due to the BM25+ scoring algorithm taking into consideration the field length in the matching document versus the average length of the field in all documents. In this case, the average length of the artist
field is 1 term, while the average length of the title
field is 3.5. Simplifying, "queen"
matches 1 out of 2 terms in "Dancing Queen"
, but on average the artist
field 3.5 terms long, so this match gets a boost, while the match in title
does not get such boost, as the length is the same as the average.
The formula is more complex than that (you can read about it here), but the essential point is that a match in a field that is shorter than the average gets a boost (see the |D|/avgdl
factor in the denominator of the formula).
If you add more documents, changing the average length of fields, you'll see that the order of results changes, even if the results are the same:
ms.addAll([
{ id: 3, artist: 'Red Hot Chili Peppers', title: 'Californication' },
{ id: 4, artist: 'Last Shadow Puppet', title: 'Miracle Aligner' }
])
ms.search('Queen')
// =>
// [
// {
// id: 1,
// score: 2.1301057307305027,
// terms: [ 'queen' ],
// match: { queen: [Array] },
// title: "Don't Stop Me Now",
// artist: 'Queen'
// },
// {
// id: 2,
// score: 1.905500265114277,
// terms: [ 'queen' ],
// match: { queen: [Array] },
// title: 'Dancing Queen',
// artist: 'ABBA'
// }
// ]
Note that in this case the effect of adding documents is more pronounced because there are so few documents in the index. On larger indices, the impact of adding more documents would be less prominent.
In theory, the parameters of the BM25+ scoring algorithm can be tweaked. For example, setting the b
parameter to a lower value would give less importance to the field length. In practice though, I would not advise to change them, unless one really understands the implications, because it’s not trivial to optimize such parameters without making the results worse in other cases.
Hi @oasiz I will close this issue for now as I think the question was answered, but feel free to comment on it, and I will reopen it if necessary.
Take the following example:
Results:
Any reason why "Dancing Queen" scores higher than the exact match "Queen", and is there a way to favour exact matches?
As always, thanks for your help!