different pipelines for indexing and searching

zs-zs commented 10 years ago

Hi,

I'm using lunr.js in a project where autocomplete is also a requirement besides of the client-side search. I was able to make the autocomplete work with introducing an additional processing step in the pipeline, just before the stemmer does its work. But it's good to see that now there is a possibility to extend the package with the new use() api on the index. So I started to extract my solution into a lunr extension, you can check my approach here.

I'm using a Radix tree to store n-grams of the tokens (right before stemming), later you can use this tree for efficient autocomplete over the stored n-grams.

Unfortunately, there is a problem with this solution - lunr.js calls the indexing pipeline on each search - I can easily understand why you chose this approach: because usually you need the same processing steps (stemming, stop-word filtering, etc.) for the query string which you've used for indexing.

But in this case, I would need an "indexing pipeline" which is different than the "search/query string pipeline".

Previously I solved this issue in my application in a way that right after indexing, I set a flag to true indicating that indexing has finished, but IMO it's a very hackish solution - if I would like to extract this solution to a lunr extension, we would need a way for configuring different pipelines for search and indexing.

What is your opinion?

olivernn commented 10 years ago

Your extension looks really interesting, I'll take a look in more detail when I get a bit more time.

As you mention, currently lunr uses the same pipeline for both indexing and searching, this is very convenient for the current use case. I'm currently overhauling the way wildcards in search terms work and it has also made me think about different pipelines for indexing and searching. I haven't quite reached a conclusion yet, based on my requirements, but I'll definitely consider your use case when trying to come up with a solution.

I guess the real issue is how to give hooks that can be extended from during search. There are currently hooks for adding/updating/removing documents but nothing on the search side.

The work I'm doing is still quite exploratory, I'll try and get it into a state that I can push a branch here. It'd be great to allow some extensions for searching and its definitely something I'll take into account.

zs-zs commented 10 years ago

In my opinion, allowing different pipelines for indexing and searching is a feature that could be useful in other scenarios as well. It would create extension points for synonym expansion or more elaborate query parsing methods. Lucene / Solr also allows this via its Analyzer api.

I have to take a look to decide how it is currently connected to the wildcard handling in lunr, but at the first sight, it seems to be not a big change to allow different pipelines.

We have to create two different pipelines in the index:

this.indexPipeline = new lunr.Pipeline  // the old pipeline
this.searchPipeline = new lunr.Pipeline   // the search pipeline

And then in the query method we would call the search pipeline:

var queryTokens = this.searchPipeline.run(lunr.tokenizer(query))

You have to initialize the new pipeline - by default it would contain the same steps as the indexPipeline:

idx.searchPipeline.add(
   lunr.trimmer,
   lunr.stopWordFilter,
   lunr.stemmer
)

You also have to modify the serialization / deserialization functionality of the index to serialize/deserialize the new pipeline as well.

It is just an overview, but you can see my approach. I think this change would not be too problematic. What is your opinion?

Maybe I will try it out when I have time and create a pull request then you can check the code.

olivernn commented 7 years ago

This is now supported in version 2 of Lunr which is available on npm now. Please take a look and let me know if you have any feedback, and open an issue if you run into any problems.

olivernn / lunr.js

different pipelines for indexing and searching #80