Open mlvzk opened 3 years ago
I implemented this in #1073
Another common use case is synonym expansion. A tokenizer might emit synonyms too.
It is possible to emit synonyms at index time (faster search, large index, less flexible), or at search time (faster indexing, smaller index, more flexible.)... But doing both is useless.
@PSeitz can you have a look at this PR and eventually merge it? @mlvzk In order to ease the release process, we try to keep CHANGELOG.md up to date. Can you add a new entry in the changelog for Tantivy 0.16 and add your contrib in one line as part of this PR?
I added an entry to the change log and Closes ...
to the commit.
Is your feature request related to a problem? Please describe. I'd like to use a different tokenizer for search than for indexing. I have a prefix-only ngram field that then gets filtered through lowercase, for search I only want the Lowercasing step, so that the tokens for my search query look like ["sample"] instead of ["s", "sa", "sam", "samp", "sampl", "sample"], because I don't want results for Samuel, just "sample" or words that start with "sample".
Describe the solution you'd like I'd like to be able to define my search tokenizer on a field in an index, similarly to how I can define a tokenizer right now, and for it to be used in all queries. My PR #1073 does it like this:
This feature is called "search_analyzer" in Elasticsearch and it's defined on a field mapping. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html "analyzer" is the indexing tokenizer in elasticsearch, that is also used for searching when "search_analyzer" is not specified.
That link also explains why it's useful: