Open DannyNemer opened 11 years ago
What version of lunr are you using, and what browser are you seeing these results in? If you could provide some sample code to re-produce these then that would be great! I tried to re-create these results and failed - http://jsfiddle.net/ZDk8g/
I think in this example "yes" and "ye" are actually stored in the index in the exact same way, since "yes" is stemmed by the default stemmer to "ye".
I am using v0.3.3 with node.js. I apologize, the issue only occurs for the particular example I used above when the stemmer is disabled. Here are two better examples of the issue occurring when lunr is in its default configuration: http://jsfiddle.net/z6Dm6/. For these two examples, the results are the same when either or both the stemmer and stop word filter are enabled or disabled.
Thanks for these examples, makes it much easier for me to understand the problem you are having. I'll take a look and see what is going on…
I've taken a closer look into this and it is down to how the similarity score is calculated.
Firstly I think there is a possible change into how non-exact matches are scored; it should take into account how dissimilar the token is to the word it is being expanded into, e.g.
"hell" is more similar to "hello" than "he".
The other problem is in calculating the similarity score. It is done by treating the search query and the documents as vectors and calculating the angle between them. As in your example, if each word differs from the document word by the same amount it is seen as being more similar than a query where only one word differs from the result. This currently means that there are cases, as you've found, where the scores seem slightly off.
I think this is more of a problem with shorter documents, like in your example, I need to do a bit more investigation though and see what, if anything can be done to resolve the problem.
I am experiencing a very similar, yet new issue in v0.4 (the issue above persists). The new issue is demonstrated here: http://jsfiddle.net/7eGpQ
As shown, my index has the documents 'photo'
and 'photograph'
. When searching with the queries 'p'
, 'ph'
, 'pho'
, 'phot'
, 'photo'
, and 'photogr'
, I receive inconsistent and unexpected scores (which I describe in my comments in the Fiddle). Finally, when searching for 'photo'
, not only do I not receive a perfect score, I receive a score lower than all previous queries.
Thank you very much for your excellent work. Lunr is fantastic.
I found another, similar instance that produces inconsistent and unexpected scores: http://jsfiddle.net/FZREx
As noted in the comments in the Fiddle, shortening the text of other documents in the index yields a lower score for the document being searched for. This makes sense because the search query is now more similar to other documents in the index. However, this results in the score being lowered too far, as shown.
Thanks again for your investigation into this issue.
I've taken a look at the photo example you posted, this again looks like an issue with the automatic wild card that is used when you do a search currently.
You can see this for yourself at lib/index.js:301
where it expands the query term. In the example index 'photo'
expands to ['photo', 'photograph']
.
So you get the following vectors:
var queryVector = [1.6931471805599454, 1.052011492633005],
photoVector = [1.6931471805599454, 0],
photographVector = [0, 1.6931471805599454]
And so the queryVector does not exactly match the photo vector, hence the score less than 1. When doing a search for photograph
this doesn't happen because photograph
doesn't expand to anything, so you get the following vectors:
var queryVector = [0, 1.6931471805599454],
photoVector = [1.6931471805599454, 0],
photographVector = [0, 1.6931471805599454]
Hence you get a score of 1 for the photograph
search.
Without the automatic wildcarding photo
would not be expanded and then you would get the result you expect.
I'm not sure the best way to progress this issue, I have opened a separate issue #37 to discuss a feature to add lower level query interface that would not have automatic wildcarding. I still think that in general use having the automatic wildcard at the end of each query term is useful, but perhaps there could be a way to disable this? E.g. idx.search('photo', false)
or idx.search('photo', { autoWildcard: false })
.
I have the exact same problem. Unfortunately query terms which are even an exact match are rated lower that partial matches.
E.g.: you have a data set off [foo, foobar]
and if you search it for foo
lunr will return [foobar, foo]
.
Besides this little detail, thank you very much for this awesome project.
The following are steps to reproduce an issue I am experiencing. First, create an index and add a two-word String:
This first query is a portion of the first word followed by the complete second word:
This second query also begins with a portion of the first word (can be identical to the term in the previous query, or not) followed by only a portion of the second word:
Issue: Why does the second, less accurate query return a higher score than the first, more accurate query?
Note: For this particular example, the issue occurs only when the stemmer is disabled. See my comment below for better examples.
Thank you very much.