olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 548 forks source link

Longer words don't match shorter root words #391

Closed joshdhenry closed 5 years ago

joshdhenry commented 5 years ago

Using Lunr 2.3.5. Here is a small example:

    const data = [{name: 'competitive'}, {name: 'compete'}];

    const index = lunr(function () {
        this.ref('name');
        this.field('name');
        data.forEach((item) => this.add(item));
    });

    console.log(index.search('competitive'));
    // runString() returns 'competit'.
    // Result: [ { ref: 'competitive', score: 0.693, matchData: { metadata: { competit: { name: {} } } } } ]

    console.log(index.search('compete'));
    // runString() returns 'compet'.
    // Result: [ { ref: 'compete', score: 0.693, matchData: { metadata: { compet: { name: {} } } } } ]

I'm confused as to why when I search 'competitive', it returns 'competitive' but it doesn't also return 'compete'. I would think 'compet' would be considered the root word of both.

Is this to be expected with the Porter stemmer algorithm? Is there any way I can make a search for 'competitive' return 2 results - 'compete' and 'competitive'. This probably occurs with many words so performing manual fixes on certain words is less desirable.

hoelzro commented 5 years ago

Hi @joshdhenry!

Judging from a cursory test using Python's NLTK framework, it looks like this is just the Porter stemmer in action, as you suggested:

Python 3.7.2 (default, Jan 10 2019, 23:51:51) 
[GCC 8.2.1 20181127] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.stem.porter import PorterStemmer
>>> s = PorterStemmer()
>>> s.stem('compete')
'compet'
>>> s.stem('competitive')
'competit'

To answer your question about making the search do the right thing for "compete" and "competitive", I'm afraid you'll either need to use a different stemming algorithm, or write a pipeline function that treats the two as synonyms.