olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.91k stars 546 forks source link

Using invertedIndex for autocomplete #287

Closed hungerburg closed 6 years ago

hungerburg commented 7 years ago

Not so much an issue with lunr, which is great! More a quick try to get ideas going…

In a shell with the jq utility I pull my terms from the lunr index in advance: jq '[.index.invertedIndex[][0]|scan("^\\w{3,}")]|unique' index.json > iindex.json I can feed that to http://api.jqueryui.com/autocomplete/ widget like below

   function normalize(str) {
      var map = { "ä": "a", "ö": "o", "ü": "u", "ß": "ss" };
      return str.replace(/[^A-Za-z0-9]/g,
         function(a) { return map[a]||a; }
      );
   }
   $.getJSON('iindex.json', function (tags) {
      $('#query').autocomplete({
         minLength: 3,
         source: function(inp, out) {
            var t = normalize(inp.term);
            var r = $.ui.autocomplete.filter(tags, t);
            out(r);
         }
      });
   });

Not fully nice, but works acceptably so far. Now truly nice would be to create the autocomplete index on the client and have the term to match processed by the indexer instead of that crude normalizer above.

olivernn commented 7 years ago

Sorry for the late reply.

You could definitely wrap that normalise function up into a lunr plugin. There is a similar project, lunr-unicode-normalizer, but I don't think it has been updated for lunr 2.

As for autocomplete, I need to get round to actually putting a demo of this together, but this is what I've been suggesting to people.

idx.query(function (q) {
  // exact matches should have the highest boost
  q.term(searchTerm, { boost: 100 })

  // prefix matches should be boosted slightly
  q.term(searchTerm, { boost: 10, usePipeline: false, wildcard: lunr.Query.wildcard.TRAILING })

  // finally, try a fuzzy search, without any boost
  q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
})

I disable the pipeline to prevent stemming getting in the way, you would have to experiment if this makes sense for your use case, especially if you wanted to add the unicode normalising plugin.

Additionally, when using the query method lunr won't be doing any tokenisation for you, you can either handle this your self, or borrow the lunr.tokenizer directly, or its regex to split into individual tokens.

hungerburg commented 7 years ago

Dear Oliver, no need to be sorry! With the help of your snippet I cobbled together a little more context, so that, in fact, I get around creating an extra iindex upfront but instead can feed the autocomplete from lunr itself and also can get rid of my own normalizer by using the pipeline:

   // LUNR query for autocomplete terms
   function acsearch(searchTerm) {
      var results = index.query(function (q) {
         // exact matches should have the highest boost
         q.term(searchTerm, { boost: 100 })
         // wildcard matches should be boosted slightly
         q.term(searchTerm, { boost: 10, usePipeline: true, wildcard: lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING })
         // finally, try a fuzzy search, without any boost
         q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
      });
      if (!results.length) { return ""; }
      return results.map(function(v, i, a) { // extract
         return Object.keys(v.matchData.metadata);
      }).reduce(function(a, b) { // flatten
         return a.concat(b);
      }).filter(function(v, i, a) { // uniq
         return a.indexOf(v) === i;
      });
   }

   // install jquery autocomplete widget
   $('#query').autocomplete({
      appendTo: '#dlg',
      minLength: 3,
      source: function(inp, out) {
         out(acsearch(inp.term.toLowerCase()));
      }
   });

The resulting list reads much the same -- though now it is sorted by score and some new fuzzy terms appear. Performance on my desktop is also good. I think this is showing all that can be pulled from an inverted index - anything more would need a mapping to the unstemmed original text and that is beyond the scope of this experiment, at least for now. Thank you very much indeed!

olivernn commented 7 years ago

@hungerburg glad you managed to make some progress. Indeed the stemming does make implementing really good autocomplete difficult.

Just thinking aloud, but you can get Lunr to create and store the mapping to the unstemmed words, though it will likely increase the index size considerably.

During indexing you can add a pipeline function that stores the unstemmed words as metadata for a token. The concept is discussed a bit in the guides.

You would need to add a pipeline function before the stemmer:

var storeUnstemmed = function (token) {
  token.metadata['unstemmed'] = token.toString()
  return token
}

The idea would be that you would then get the unstemmed words from the search results, the guide has a more detailed example.

First prize would be able to have an algorithmic unstemmer, but I don't think its possible.

olivernn commented 6 years ago

Closing this as it has gone stale, feel free to comment or re-open if you still have any unanswered questions.

chrisbartley commented 5 years ago

First, a HUGE thanks to both @hungerburg and @olivernn for this. I combined both suggestions and it's working great. For anyone wanting to do the same, this is what worked for me...

I'm indexing like this:

// Store unstemmed term in the metadata.  See:
// https://github.com/olivernn/lunr.js/issues/287#issuecomment-322573117
// https://lunrjs.com/guides/customising.html#token-meta-data
const storeUnstemmed = function(builder) {

   // Define a pipeline function that keeps the unstemmed word
   const pipelineFunction = function(token) {
      token.metadata['unstemmed'] = token.toString();
      return token;
   };

   // Register the pipeline function so the index can be serialised
   lunr.Pipeline.registerFunction(pipelineFunction, 'storeUnstemmed');

   // Add the pipeline function to both the indexing pipeline and the searching pipeline
   builder.pipeline.before(lunr.stemmer, pipelineFunction);

   // Whitelist the unstemmed metadata key
   builder.metadataWhitelist.push('unstemmed');
};

const index = lunr(function() {
   this.use(storeUnstemmed);
   ...
});

And modified the autocomplete function suggested by @hungerburg to use the unstemmed words like this:

autoComplete(searchTerm) {
   const results = this._index.query(function(q) {
      // exact matches should have the highest boost
      q.term(searchTerm, { boost : 100 })
      // wildcard matches should be boosted slightly
      q.term(searchTerm, {
         boost : 10,
         usePipeline : true,
         wildcard : lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
      })
      // finally, try a fuzzy search, without any boost
      q.term(searchTerm, { boost : 1, usePipeline : false, editDistance : 1 })
   });
   if (!results.length) {
      return "";
   }
   return results.map(function(v, i, a) { // extract unstemmed terms
      const unstemmedTerms = {};
      Object.keys(v.matchData.metadata).forEach(function(term) {
         Object.keys(v.matchData.metadata[term]).forEach(function(field) {
            v.matchData.metadata[term][field].unstemmed.forEach(function(word) {
               unstemmedTerms[word] = true;
            });
         });
      });
      return Object.keys(unstemmedTerms);
   }).reduce(function(a, b) { // flatten
      return a.concat(b);
   }).filter(function(v, i, a) { // uniq
      return a.indexOf(v) === i;
   });
}

Thanks!