Open wballard opened 10 years ago
Thanks -- I can see how I totally copy-pasta that same doc error.
Many thanks for taking the time to look into this.
I think that an ngram tokeniser would make a great plugin for lunr, as part of the changes I am making for better i18n support I am add a very simple plugin system that I think you could take advantage of. It's great to have another potential use case for a plugin so that I make sure the API is flexible enough.
Let me take a closer look through your changes and see if I can make some suggestions of how to extract this as a plugin.
Thanks again!
Any update on this?
Hey, is there an ETA for merging this or the plugin system mentioned? Would love to use it!
@olivernn can this be merged in or is the plugin system ready yet?
I would also like to contribute ngram analyzers for autocomplete. what is the status of this? it's been open for a year now and so I'm hesitant to do any more work on it.
The means to add plugins to lunr already exists. The main extension point is to modify an indexes text processing pipeline. Each index has its own pipeline, and so a plugin can safely modify the pipeline of the index it is being applied to.
I think though that in these cases the tokenizer needs to be modified. This is possible but for reasons the tokeniser is global, not individual per index. So all indexes will then be forced to use the replacement tokenizer, this may or may not be a problem.
An example:
var myNgramTokenizer = function () {
lunr.tokenizer = function (obj) {
// ngram implementation
}
}
idx.use(myNgramTokenizer)
I'm not sure why the tokenizer is not a property of the instance of lunr index, I will take a look at this.
@olivernn Great work! Any chance this could be merged? ngram and edgengram are must have nowadays... I'd love to see it built-in or as a plugin.
Is there anything we can do?
I use this library all the time, thanks for making it available. One use case we keep doing more is client side autocomplete, and have found that ngram indexing on the server -- usually ElasticSearch -- is giving us the best results. I just need that functionality client side, and in node.js, and don't care to fuss with going out of process to Elastic Search if I can avoid it.
I tried to follow along with your style and formatting, and hopefully did so to your satisfaction.
This sets up an index level tokenizer, I didn't dive as far in as #21, as that implies field level pipelines and tokenizers -- which really then should have some extension to pipeline to 'start' with a tokenizer then stream through multiple filters in the pipeline -- or some other field object that combines a tokenizer and pipeline.