olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

Odd bug in results sorting #74

Closed scottlet closed 7 years ago

scottlet commented 10 years ago

Hi, I've got about 9000 food items in an array. I'm wanting to use lunr to match results and order them. So far so good.

Having tried it in Node, I'm getting an error. I thought I'd try it using the front end and I get the same error. Namely, searching for "bread" brings back "seafood breader" first, then "breadfruit" and then finally "bread". I'd expect "bread" to be first...

I've uploaded my test case including the data here: http://03sq.net/lunr-test/ as I'm not sure if I've done something obviously wrong or if this is a bug or how to debug it :)

olivernn commented 10 years ago

Firstly, thanks for a great bug report, having a test case like the one you have put together makes it so much easier to try and diagnose the issue.

I'll try and describe what is happening here, hopefully it makes sense!

When you search for bread lunr is treating that as a term with an implicit wildcard at the end, e.g. bread*. This term is expanded into the following terms ["bread", "breadcrumb", "breadstick", "breader", "breadfruit"]. You can see this for yourself by calling idx.tokenStore.expand('bread'). These terms are then the ones used to try and find matching documents.

lunr uses TF-IDF to rank how similar a document and a search term are. The IDF part of this, inverse document frequency, penalises tokens that are common in the corpus (the total collection of documents). In the case of your index the token bread appears a total of 87 times, where as the token breader appears only once. Again you can check this using the following snippet: Object.keys(idx.tokenStore.get('bread')).length. You can see the affect this has on the terms IDF score, bread has a value of 6.4740077904950954 whilst breader does much better at 10.93991590914968, calculated using idx.idf(token).

There are measures in place to try and ensure that exact matches get a score boost, however this isn't a significant enough boost in your use case.

This is an issue that has cropped up before, I think in your case the issue may be amplified by the small size of the documents.

As for a solution, at the moment I'm not sure. There are a couple of issues like this that have prompted me to take a closer look at how scoring/ranking of search results is calculated, I don't have anything definitive yet but these are problems I'd like to solve.

A potential work-around for you is to disable the IDF calculation, this can be achieved fairly simply (though via monkey freedom patching)

idx.idf = function () { return 1 }

I'll definitely keep this in mind though for upcoming releases, perhaps a simpler way to disable IDF checking, I'll have a think.

micahbolen commented 10 years ago

+1 for highly educational discourse

scottlet commented 10 years ago

Thanks for this! Perhaps there might be a way to intelligently guess after indexing the data what the shape of the data is and enable or disable the IDF calculation accordingly as well as creating some kind of option to turn it on and off.

I’ll have a try with your monkeyfreedompatch when I get back to the office tomorrow :)

Very impressed with Lunr so far!

olivernn commented 7 years ago

The latest version of Lunr (v2) no longer automatically inserts wildcards at the end of queries. A search for "bread" will not return any results for "seafood breader" or "breadfruit". Wildcards are still supported, but must be explicit. To re-create the behaviour in this issue you would have to search for "bread*".

So, it only took me 37 months to fix this issue, not bad!