olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 548 forks source link

To stopword or not #404

Closed chriscmuir closed 5 years ago

chriscmuir commented 5 years ago

We're getting stuck on a challenge with stopwords, and we're hoping somebody else can provide guidance how they've solved this please?

We have a lunr index with a wide corpus of entries, containing a varied range of business domain knowledge. For most purposes the stopwords are working well removing lots of unnecessary words from the index.

However one challenge we have is the stopword "on". In our corpus the term "single sign on" is an important one as it is a regular search term, but thanks to the "on" stopword appears to be getting shortened to "single sign", and our results are getting skewed towards answers about signing documents rather that the single sign on result we're interested in.

So we're kind of stuck in the middle. The "on" stopword is useful for the majority of our corpus, but not for one of our important entries "single sign on".

How have others solved this? Is there someway to protect certain phrases from stopwording in all cases? Appreciate your advice on this subject please.

For what it's worth, it looks to be Elastic Search has identified this problem by introducing Common Terms Query: https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms-query

olivernn commented 5 years ago

That article on Common Terms Query is very interesting. It is almost possible to implement that for Lunr, save for one piece of data that a Lunr index has, but does not expose.

If I'm understanding the article correctly the basic algorithm is:

  1. At query time check which terms are 'stop words'
  2. Perform the query without those 'stop words'
  3. Depending on the number of results perform additional queries with those stop words

So, it is clearly already possible to manipulate queries in Lunr using the lower level query API, the tricky part is detecting which query terms are stop words. The smart way would be to check the index for how common a term is, this data is available on the index via the invertedIndex property, but it is not public. Alternatively the stop word filter could be used/adapted. This approach might be worth investigating...

Alternatively you could wrap the existing stop word filter and check for phrases before filtering stop words, this is possible because a pipeline function gets passed three arguments (similar to Array#map)

  1. The current token
  2. The current tokens index within all the tokens in this document field
  3. The list of all tokens within the document field

Using these arguments it should be possible to detect a phrase such as "single sign on" and then not call the stop word filter with "on". This approach is less sophisticated than the common terms query but maybe less work initially.

If your interested in implementing something like the Common Terms Query then we could think up a public interface to expose statistics about terms within the index.

Please let me know if you have any more questions, I'd be interested also in how you end up solving this!

chriscmuir commented 5 years ago

Thanks for the detailed reply Oliver. I think we'll need to take the second route of stopping the stop word filter via pipeline functions.

Thanks for your work and support on lunrjs.