olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

lunr fails to find matches in entire words. #137

Closed MathieuDuponchelle closed 9 years ago

MathieuDuponchelle commented 9 years ago

Not sure how to best phrase this, here is an example :

  var index = new lunr.Index();

  index.ref('id');
  index.field('title', { boost: 10 });
  index.add({                                                                                                          
        id: "ges.timelineelement",                                                                                     
        title: "TimelineElement",                                                                                      
      });                                                                                                              
  var results = index.search("TimelineElem");                                                                          
  console.log ("results length : " + results.length);                                                                  
  var results = index.search("Element");                                                                               
  console.log ("results length : " + results.length);

this prints:

results length : 1
results length : 0

I may be missing something obvious here, but I would expect the index to match Element with TimelineElement with a low-ish score, what's going wrong here ?

weixsong commented 9 years ago

actually the function you want is not supported by lunr.js, lunr.js could not search by middle part of one entire token. I think even google don't do this.

olivernn commented 9 years ago

@weixsong is right, this isn't currently available in lunr. The search is always a prefix search, so in your example a search for TimelineElem works because lunr is putting in wildcards at the end, e.g. TimelineElem becomes ^TimelineElem*. When searching for Element lunr is basically doing a search for ^Element* which does not match the documents you have.

It looks like, from your example, that you want lunr to be able to understand snake cased tokens as seperate tokens, e.g. TimelineElement is actually made of two tokens, timeline and element. You can add a custom tokeniser to the pipeline to split snake cased words like this into seperate tokens and then your search for Element will work.

MathieuDuponchelle commented 9 years ago

Yeah that's what I ended up doing, thanks

On Thu, Apr 9, 2015 at 9:26 PM, Oliver Nightingale <notifications@github.com

wrote:

@weixsong https://github.com/weixsong is right, this isn't currently available in lunr. The search is always a prefix search, so in your example a search for TimelineElem works because lunr is putting in wildcards at the end, e.g. TimelineElem becomes ^TimelineElem. When searching for Element lunr is basically doing a search for ^Element which does not match the documents you have.

It looks like, from your example, that you want lunr to be able to understand snake cased tokens as seperate tokens, e.g. TimelineElement is actually made of two tokens, timeline and element. You can add a custom tokeniser to the pipeline to split snake cased words like this into seperate tokens and then your search for Element will work.

— Reply to this email directly or view it on GitHub https://github.com/olivernn/lunr.js/issues/137#issuecomment-91333893.

MathieuDuponchelle commented 8 years ago

@weixsong by the way of course google do this :)