olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 545 forks source link

Replacing trimmer has problems adding to pipeline and with multilanguage #348

Closed crystalfp closed 6 years ago

crystalfp commented 6 years ago

I want to replace the standard trimmer with an "improved" one. Following the example on the customization page I defined a plugin this way:

const improvedTrimmer = function (builder) {

    const pipelineFunction = function(token) {
        return token.update(function(s) {
            return s    .replace(/\\_/, "_")
                    .replace(/\\!/, "!")
                    .replace(/www\./, "")
                    .replace(/^\W+/, "")
                    .replace(/\W+$/, "");
        })
    };

  // Register the pipeline function so the index can be serialized
  lunr.Pipeline.registerFunction(pipelineFunction, 'improvedTrimmer')

  // Add the pipeline function to both the indexing pipeline and the searching pipeline
  builder.pipeline.before(lunr.trimmer, pipelineFunction);
  builder.pipeline.remove(lunr.trimmer);
  builder.searchPipeline.before(lunr.trimmer, pipelineFunction);
  builder.searchPipeline.remove(lunr.trimmer);
}

The line builder.searchPipeline.before(lunr.trimmer, pipelineFunction); fails because lunr.trimmer is not defined in the searchPipeline. It works if I use lunr.stemmer (and not remove it) as in the example, but not sure the result is the same. My question: is still needed to add my function to the searchPipeline?

Adding to lunr this way:

fullTextIndex = lunr(function() {
    this.use(improvedTrimmer);
    this.use(lunr.multiLanguage("en", "it"));
        ...

works, but not sure my trimmer is used. BTW, if I switch the two this.use lines my plugin fails because lunr.trimmer is no more in the pipeline.

I'm using version 2.2.1 under Node.js and have to say lunr is really a fantastic tool that needs only a little more documentation and examples.

olivernn commented 6 years ago

The default search pipeline does not include lunr.trimmer, the error thrown could probably include a little more detail.

A search is done either via lunr.Index#search or lunr.Index#query. In the first case the query parser handles the basic trimming as performed by lunr.trimmer, for the latter, it is assumed that the term being searched for is already pre-processed in whichever way makes sense for the index. lunr.Index#query is intentionally more low level, in the hope that it offers enough flexibility.

As for you specific case, is there any need to replace the existing trimmer? Could you not just have your custom trimmer in addition to the default trimmer?

You can certainly add your custom trimmer to the search pipeline. The removal of leading/trailing whitespace should have no difference to lunr.Index#search as by the time a search term enters the search pipeline it will already be trimmed of whitespace.

If you do add it to the search pipeline I would recommend adding it before the stemmer so that the stemmer sees as clean a token as possible.

BTW, if I switch the two this.use lines my plugin fails because lunr.trimmer is no more in the pipeline.

I think this is because the lunr-languages plugins replace the trimmer with their own language aware trimmers. The default trimmer is pretty basic and might not work well with non ascii characters.

crystalfp commented 6 years ago

Thanks for the explanation. I added to the trimmer the removal of some string, not strictly speaking a character, but more a combination of separator and stop word. For example to remove .pdf extensions without adding the '.' to the separator list. So really it is not needed in the search pipeline.

Anyway, querying my small database (composed by markdown texts) continue to improve after I updated the trimmer and the lunr.tokenizer.separator.

Thanks again!

crystalfp commented 6 years ago

Few tests more...

Seems my trimmer is ignored. In the index (invertedIndex) there are still tokens that ends in .pdf even if I explicitly remove it in the trimmer (.replace(/\.pdf$/, "")). Seems the multilanguage extension overwrite everything in the pipeline.

BTW, I removed lunr.trimmer from the pipeline trying to outsmart multilanguage...

So seems the only solution is to return to this.pipeline.before(this.pipeline._stack[0], improvedTrimmerPipelineFunctionOnly); after loading multilanguage. Ugly, but at least it does not ignore me.

olivernn commented 6 years ago

Ah, thats annoying. It does look like the multi language plugin resets the pipeline, seems a bit hostile!

crystalfp commented 6 years ago

I confirm that if I put the loading of my trimmer after the multilanguage plugin, it is called.

Maybe adding to .before() and .after() the possibility to access the reference function (1st argument) by label could alleviate this problem.

That is, instead of my horrid solution this.pipeline.before(this.pipeline._stack[0], myfunc) I can just say this.pipeline.before("lunr-multi-trimmer-en-it", myfunc). Just a suggestion. Thanks again!

olivernn commented 6 years ago

I'm going to close this now as I don't think there is anything to be done on the original topic of this issue.