olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

No results for certain query lengths #38

Open mikeal opened 11 years ago

mikeal commented 11 years ago
> var idx = lunr(function () { this.field('title') })
> var doc = {"title": "Excavator", id:1}
{ title: 'Excavator', id: 1 }
> idx.add(doc)
undefined
> idx.search('exc')
[ { ref: '1', score: 1 } ]
> idx.search('exca')
[ { ref: '1', score: 1 } ]
> idx.search('excav')
[ { ref: '1', score: 1 } ]
> idx.search('excava')
[]
> idx.search('excavat')
[]
> idx.search('excavato')
[]
> idx.search('excavator')
[ { ref: '1', score: 1 } ]
olivernn commented 11 years ago

Thanks for reporting this, I've put together a fiddle to try and explain the issue a little more - http://jsfiddle.net/bt5yq/1/

This looks like an issue with the stemmer, in the fiddle I have shown what each query stems to. The term in the document 'Excavator' stems to 'excav', however 'excavato', for example, stems to `excavato' which doesnt exist in the index and therefore wont give any results.

The problem seems to come down to the fact that you are stemming partial words, the partial words are probably not matching against any of the known word endings, and therefore the stemmer is getting confused and doing nothing. I'm not sure this is something that can be fixed in the stemmer.

Other language processing may help here, for example Metaphoning, it is a processor I want to have available for lunr, I haven't got around to it yet though. It will reduce words to the phonetic sounds, and is usually used to help with reducing the impact of spelling mistakes in searches.

edwardball commented 11 years ago

Also suffering with this. If it can't be fixed in the stemmer, perhaps it would be possible to do something along the lines of this:

If the most recently typed query ("excava") doesn't return any matches, but the previous typed query did ("excav"), then you can assume that "excava" will use the same stem as "excav". Then, as long as any matches returned from the index are checked to make sure they include "excava", you would get the correct results and stop the chance of returning false positives.

olivernn commented 11 years ago

@Aptary that would probably work but I'm not sure it is the best approach for lunr to take. Making the search method state-full would introduce many more edge cases that could/would lead to some subtle bugs and potentially very odd behaviour.

Your approach might be something that you want to try as a wrapper around your use of lunr though. I can imagine having something that wraps the search method and caches the returned results for each fragment of query being used, allowing you to handle these strange edge cases. You would probably want to expire this cache after a certain amount of in-activity or perhaps when the user hits the enter key on a search box. At this point though it would be fairly app specific and something I think is best left outside of lunr.

samchr commented 10 years ago

I hate to bring up this old issue again, but we're hitting the same bug, and I believe it's negatively affecting the user experience. We're trying to use lunr for a real-time search where the results update as the user types. The problem comes when a user types the word "genomic" and they get the expected results for both "genom" and "genomic" but "genomi" shows nothing. That just looks awkward and confusing. I quickly suspected that the stemmer was to blame, and it looks like I was right. I would love to try to find a way to make this work in lunr. Any thoughts?

DannyNemer commented 10 years ago

@samchr You might need to remove the stemmer from the pipeline:

var index = lunr(function () {
    this.ref('id')
    this.field('text')
})

index.pipeline.remove(index.stemmer)

index.add({
    id: 1,
    text: 'genomic'
})

console.log(index.search('genom')) // [ { ref: '1', score: 1 } ]
console.log(index.search('genomi')) // [ { ref: '1', score: 1 } ]
console.log(index.search('genomic')) // [ { ref: '1', score: 1 } ]

Alternatively, you can just initialize an index, which does not include the stemmer:

var index = new lunr.Index
index.field('text')

index.add({
    id: 1,
    text: 'genomic'
})