olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

Query about stop words and the relationship to stemmer. #194

Closed rluiten closed 7 years ago

rluiten commented 8 years ago

Firstly just want to say lunr.js is very cool.

I have been implementing lunr in Elm language. Have run across something I found a bit odd, and thought I would check with you as it affects lunr.js as well as far as i know.

In a test I noticed that one of the stop words was not pre stemmed, given my reading of lunr.js this means those stop words can never be applied ?

3 Examples of the 24 I found against my code base.

I can supply the word miss matches but several wont apply to lunr because of the changes to my stemmer mentioned below.

I am pretty sure that lunr.js doe snot does not stem the stop words list before using them as stop word filters. If you agree this is likely an issues maybe running the default stop words and any user supplied ones through the stemmer to make the filter would address the problem ?

I don't believe we can pre stem the words in case they add additional transforms to the token processing.

I wrote a quick test to check the default stop words against the stemmed version of them and found a bunch that dont match, now my stemmer has drifted away from the lunr one because I decided to make it pass the tests avaialble on the porter stemmer page at http://tartarus.org/martin/PorterStemmer/. Namely the voc.txt and output.txt files I turned into a big slow test to check occasionally.

Cheers Robin.

olivernn commented 8 years ago

The list of stop words has not been stemmed. Lunr applies the stop word filter before stemming the tokens. This could mean that we would admit tokens into the index that would share a stem with a stop word. In practise, this isn't a problem, I don't think there is an issue with any common words sharing a stem with because or however. likely seems more, ahem, likely, to share a stem, but like is also in the stop word list.

I think, in general, the idea is that the stop words must match exactly, so I don't think this is an issue. If you can think of cases where it would cause problems please let me know.

With regards to the stemmer, lunr's stemmer is also an implementation (or copy) of the PorterStemmer you reference. There are tests also, which I assumed I also got from the PorterStemmer project, though I can't remember. Do these tests not match what you have?

p.s. The elm version of lunr is really cool!

rluiten commented 8 years ago

I mentioned those 3 words as I ran into hiccups with all 3 of them

Specifically the porter stem of "because" is "becaus" and if the stemmer is not applied to the stop word list before it creates the stop word filter then any time the word "because" is find in article to index it won't get blocked because word from article is "becaus" which won't match "because" unstemmed stop word.

However from your comment it is likely I just miss read the lunr.js code, and missed where you applied the stemmer to the stop words before creating the filter list.

As to the stemmer differences, this was the only one afaik it is in "Step1 c"

rluiten commented 7 years ago

A brief follow up I just got a nice bug report that demonstrated a clear issue with running the stemmer before applying those stop words. So I have changed my behavior to apply filters before stemmer.