mattico / elasticlunr-rs

A partial port of elasticlunr to Rust. Intended to be used for generating compatible search indices.
Apache License 2.0
52 stars 23 forks source link

Allow custom tokenizers #32

Closed aconradi closed 3 years ago

aconradi commented 3 years ago

I'd like to see some API to build the index with a custom tokenizer. For my case I index documentation of (among other things) commands for a command-line-interface. Those commands often have names like foo-bar, and users expect to be able to search for and find the documentation for such commands. With the default tokenizer the index instead contains the two tokens foo and bar. I did not find any API to change the default tokenizer.

I have a local patch to add an Index::add_doc_with_tokenizer method. I'd be happy to share if I can figure out the company policy for doing so. But I'm also not sure if that is the best way to add such functionality.

mattico commented 3 years ago

Yeah, it would be nice to allow custom tokenizers. It's already gotten a bit disorganized with e.g. tokenize_chinese. It's probably about time to consider a 3.0 with a better API that doesn't necessarily mirror the JS one so much.

I would take a patch that adds a hook somewhere for a custom tokenizer in the mean time. Index::add_doc_with_tokenizer seems fine given the limited flexibility of the Pipeline currently.

mattico commented 3 years ago

Fixed by #37