Created an initial pluggable tokenizer with ngram support in order to allow using lunr to drive autocomplete style search boxes.

olivernn / lunr.js

A bit like Solr, but much smaller and not as bright

http://lunrjs.com

MIT License

8.96k stars 548 forks source link

Created an initial pluggable tokenizer with ngram support in order to allow using lunr to drive autocomplete style search boxes. #63

Open wballard opened 10 years ago

wballard commented 10 years ago

I use this library all the time, thanks for making it available. One use case we keep doing more is client side autocomplete, and have found that ngram indexing on the server -- usually ElasticSearch -- is giving us the best results. I just need that functionality client side, and in node.js, and don't care to fuss with going out of process to Elastic Search if I can avoid it.

I tried to follow along with your style and formatting, and hopefully did so to your satisfaction.

This sets up an index level tokenizer, I didn't dive as far in as #21, as that implies field level pipelines and tokenizers -- which really then should have some extension to pipeline to 'start' with a tokenizer then stream through multiple filters in the pipeline -- or some other field object that combines a tokenizer and pipeline.

wballard commented 10 years ago

Thanks -- I can see how I totally copy-pasta that same doc error.

olivernn commented 10 years ago

Many thanks for taking the time to look into this.

I think that an ngram tokeniser would make a great plugin for lunr, as part of the changes I am making for better i18n support I am add a very simple plugin system that I think you could take advantage of. It's great to have another potential use case for a plugin so that I make sure the API is flexible enough.

Let me take a closer look through your changes and see if I can make some suggestions of how to extract this as a plugin.

Thanks again!

hugovincent commented 10 years ago

Any update on this?

rowanoulton commented 10 years ago

Hey, is there an ETA for merging this or the plugin system mentioned? Would love to use it!

cvan commented 10 years ago

@olivernn can this be merged in or is the plugin system ready yet?

missinglink commented 8 years ago

I would also like to contribute ngram analyzers for autocomplete. what is the status of this? it's been open for a year now and so I'm hesitant to do any more work on it.

olivernn commented 8 years ago

The means to add plugins to lunr already exists. The main extension point is to modify an indexes text processing pipeline. Each index has its own pipeline, and so a plugin can safely modify the pipeline of the index it is being applied to.

I think though that in these cases the tokenizer needs to be modified. This is possible but for reasons the tokeniser is global, not individual per index. So all indexes will then be forced to use the replacement tokenizer, this may or may not be a problem.

An example:

var myNgramTokenizer = function () {
  lunr.tokenizer = function (obj) {
    // ngram implementation
  }
}

idx.use(myNgramTokenizer)

I'm not sure why the tokenizer is not a property of the instance of lunr index, I will take a look at this.

natcohen commented 4 years ago

@olivernn Great work! Any chance this could be merged? ngram and edgengram are must have nowadays... I'd love to see it built-in or as a plugin.

tienne commented 2 years ago

Is there anything we can do?