olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 545 forks source link

Separate Pipelines on various fields #334

Closed ToLuSt closed 6 years ago

ToLuSt commented 6 years ago

New Issue thanks to discussion from #304

Is it possible to have have two pipelines, that perform different on various fields?

So for example I have two fields in index:

Now, I don't want 'technical terms' to be stemmed, because I want that exact match to increase my Precision. On the other side, I want 'body' to be stemmed as normal for better Recall.

I hope, it's clear what I mean. Please have a look at my question. Thanks a lot!

@olivernn @hoelzro

olivernn commented 6 years ago

Having a specific pipeline for a specific field might be possible, but what I was thinking was adding the field a term appeared to as metadata on a token.

This metadata is available to each pipeline function the token passes through, so it could decide whether to take action or not based on the field the token appears in.

For your specific case this would require removing the stemmer from the pipeline and replacing it with a wrapped stemmer that delegates, or not, to the original stemmer depending on the field. The wrapper would be fairly simple:

var fieldScoped = function (fields, delegate) {
  return function (token, i, tokens) {
    if (fields.indexOf(token.metadata['field']) === -1) {
      return token
    } else {
      return delegate(token, i, tokens)
    }
  }
}

Then to use it would look like this:

var idx = lunr(function () {
  this.pipeline.after(lunr.stemmer, fieldScoped(['terms'], lunr.stemmer))
  this.pipeline.remove(lunr.stemmer)
})

This would require a small change to lunr.Builder. Still need to think about what would happen for the search pipeline as there will be no field. But an approach like this might work.

ToLuSt commented 6 years ago

Thanks for your fast replies. This sounds like a good idea. My workmate suggested quite the same

olivernn commented 6 years ago

One thing to note, you will need to register any custom pipeline functions, more info in the guides.

ToLuSt commented 6 years ago

I now did it like this: (refering to my initial post) (german-stemmer)

  1. Save special term 'Veneer' in 'technicalTerms' with a prefix e.g. m3g4sp3ci4lveneer
  2. Wrap stemmer, put it in custom pipeline, register this stemmer in CustPipeline, use CustPipeline in index
  3. When adding Doc to Index: Now in the customized stemmer, I check if the token includes the prefix

So in case of searching I also wrapped the search-function and catch on which field I search with the stemmed query and on which not

olivernn commented 6 years ago

I have just published version 2.2.0, this includes changes that should make it possible to run different pipeline functions on different fields.

Specifically, when indexing and searching, the instances of lunr.Token will have an additional entry in the metadata named fields. When building the index this will contain a single string indicating the name of the field the token appeared in. When searching this will contain all the fields that this search term is targeted at.

Hopefully this should provide enough context to achieve what you want, please take it for a spin and let me know if you run into any issues. I'll update the docs/guides just now.

ToLuSt commented 6 years ago

Thanks a lot, I'm going to test this out. But that could take a little time.. Dont have much time atm cause Im moving :-/