olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

Searching terms contain hyphen vs no hyphen #296

Closed myalgo closed 6 years ago

myalgo commented 7 years ago

I have strings in one of the fields which contain hyphen; for example anti-virus If search query is anti-virus, results returned contain terms anti and virus If search query is antivirus, no results are returned.

Is there a way to update the pipeline (or any other function in lunr) to add antivirus also ?

olivernn commented 7 years ago

This is caused by what characters Lunr considers to be word separators. It currently uses this regex /[\s\-]+/ so it sees "anti-virus" and converts that to the two tokens anti and virus. When you search for "antivirus" there is no hyphen and so Lunr considers it a single token, and it doesn't have a single token "antivirus", hence no results.

You can try changing what Lunr considers to be word separators to no longer include a hyphen e.g.

lunr.tokenizer.separator = /\s+/ // white space only

Then you can insert a pipeline function at the beginning of the pipeline. This will look for tokens that contain a hyphen and return the original token, plus tokens created by splitting on the hyphen, and maybe even without the hyphen for good luck. E.g. when we see "anti-virus" in the document, we will end up indexing "anti-virus", "anti", "virus", "antivirus".

var hyphenator = function (token) {
  // if there are no hyphens then skip this logic
  if (!token.toString().contains('-')) return token

  // split the token by hyphens, returning a clone of the original token with the split
  // e.g. 'anti-virus' -> 'anti', 'virus'
  var tokens = token.toString().split('-').map(function (s) {
    return token.clone(function () { return s })
  })

  // clone the token and replace any hyphens
  // e.g. 'anti-virus' -> 'antivirus'
  tokens.push(token.clone(function (s) { return s.replace('-', '') })

  // finally push the original token into the list
  // 'anti-virus' -> 'anti-virus'
  tokens.push(token)

  // send the tokens on to the next step of the pipeline
  return tokens
}

I've not tested the above, so you'll have to try it out and see how it works. Let me know how it works out or if you have any other questions.

myalgo commented 7 years ago

Hi Oliver,

I am using the customPipeline in following way

var lunr =  require("lunr");
var customPipeline = function (builder) {
    var pipelineFunction = function (token) {...}; //hyphenator 
    lunr.Pipeline.registerFunction(pipelineFunction, 'customPipeline');
    builder.pipeline.before(lunr.stemmer, pipelineFunction);
    builder.searchPipeline.before(lunr.stemmer, pipelineFunction);
    };
var idx = lunr(function () {
    this.use(customPipeline);
//define fields, add data
};

 var results = idx.search("anti-virus");
    console.log(JSON.stringify(results));
    var results = idx.search("antivirus");
    console.log(JSON.stringify(results));

First one prints results, but second does not.

Also I plan to use predefined/prebuilt indexes on client. Do I need to add something like lunr.use(customPipeline); on the client ?

Please let me know

olivernn commented 7 years ago

Could put together an example in jsfiddle (or similar) so I can see what is happening?

Adding your pipeline function to searchPipeline implies that is required for searching, so you will need to have your pipeline function available on the client side. The call that registers your pipeline function needs to happen both when building the index on the server side, and before loading the serialised index on the client side. As such its probably best to separate out the registering of the pipeline function from the plugin function, since you won't be calling use on the client side.

As a node module it might look something like this (untested):

var lunr = require('lunr')

var hyphenator = function (token) { ... }

// register the pipeline function, this needs to happen client & server side
lunr.Pipeline.registerFunction(hyphenator, 'hyphenator')

// define the plugin, this will only be used when building the index, not when loading
export function (builder) {
    builder.pipeline.before(lunr.stemmer, hyphenator);
    builder.searchPipeline.before(lunr.stemmer, hyphenator);
}
myalgo commented 7 years ago

Oliver,

This seems to be working. https://jsfiddle.net/msexkb0w/3/

I had missed out on the lunr.tokenizer.separator = /\s+/ earlier. Also was unsure have to use the custom code on client side. Thanks :)

olivernn commented 6 years ago

Closing this as I think there is nothing more to do. Feel free to comment if you still have unanswered questions.

PowerMogli commented 5 years ago

We are using your solution with the custom pipeline, but we are now facing some issues with this solution. Especially when using in combination with multi-language support:

import * as lunr from 'lunr';
import * as lunrDE from 'lunr-languages/lunr.de';
import * as lunrMulti from 'lunr-languages/lunr.multi';
import * as lunrStemmer from 'lunr-languages/lunr.stemmer.support';

var hyphenator = function (token) { ... }

lunrStemmer(lunr);
lunrMulti(lunr);
lunrDE(lunr);
const languagePlugin = lunr.multiLanguage('en', 'de');

lunr.Pipeline.registerFunction(hyphenator, 'hyphenator');
lunr.tokenizer.separator = /\s+/; // white space only

const _that = this;

this.index = lunr(function() {
      this.b(0);
      this.ref('_id');
      this.use(languagePlugin);
      this.use(_that._plugin);
      this.pipeline.before(lunr.stemmer, hyphenator);
      this.searchPipeline.before(lunr.stemmer, hyphenator);
      this.pipeline.remove(lunr.de.stemmer);
      this.pipeline.remove(lunr.stemmer);
      this.searchPipeline.remove(lunr.de.stemmer);
      this.searchPipeline.remove(lunr.stemmer);
    });

The problem is that the hyphenator function is returning an array of tokens but lunr.stemmer.support.js expects a string as token not an array. Therefore we get an exception in lunr.stemmer.support.js at line 136. current.charCodeAt(...) -> current is not a single token but an array

olivernn commented 5 years ago

Can you put together a simple reproduction in something like JSFiddle?

As far as I understand lunr-languages should work with both Lunr1 and Lunr2, from the README

Lunr Languages is compatible with Lunr version 0.6, 0.7, 1.0 and 2.X

cc @MihaiValentin

PowerMogli commented 5 years ago

Forgive me, I manipulated your code example of hyphenator and that's why my code didn't behave like it should. I'm sorry.. 🙏🏼

But I have an other question: we are using uuid as id values of our entities. We would like to search for entities by terms containing the id of an entity. With the help of your custom pipeline we can enrich the index with the original uuid and a version without hyphens. The problem is QueryLexer has a termSeparator which splits our uuid into several terms and this hinders us because when we search for

a9b830cb-078f-4edb-96ba-8e4c70de4b9b

QueryLexer is splitting the uuid into five terms which don't match the tokens inside the index (a9b830cb-078f-4edb-96ba-8e4c70de4b9b and a9b830cb078f4edb96ba8e4c70de4b9b).

What we already have is also this:

lunr.tokenizer.separator = /\s+/ 

Do you have any idea?

For now we also configure lunr.QueryLexer.termSeparator = /\s+/ Configuring lunr.tokenizer.separator alone does not do the trick, Because the change does not get reflected to QueryLexer.termSeparator.