Allow defining a different tokenizer for search than indexing

mlvzk commented 3 years ago

Is your feature request related to a problem? Please describe. I'd like to use a different tokenizer for search than for indexing. I have a prefix-only ngram field that then gets filtered through lowercase, for search I only want the Lowercasing step, so that the tokens for my search query look like ["sample"] instead of ["s", "sa", "sam", "samp", "sampl", "sample"], because I don't want results for Samuel, just "sample" or words that start with "sample".

Describe the solution you'd like I'd like to be able to define my search tokenizer on a field in an index, similarly to how I can define a tokenizer right now, and for it to be used in all queries. My PR #1073 does it like this:

TextFieldIndexing::default()
    .set_tokenizer("default")
    .set_search_tokenizer("raw")

This feature is called "search_analyzer" in Elasticsearch and it's defined on a field mapping. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html "analyzer" is the indexing tokenizer in elasticsearch, that is also used for searching when "search_analyzer" is not specified.

That link also explains why it's useful:

Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete or when using search-time synonyms.

mlvzk commented 3 years ago

I implemented this in #1073

fulmicoton commented 3 years ago

Another common use case is synonym expansion. A tokenizer might emit synonyms too.

It is possible to emit synonyms at index time (faster search, large index, less flexible), or at search time (faster indexing, smaller index, more flexible.)... But doing both is useless.

fulmicoton commented 3 years ago

@PSeitz can you have a look at this PR and eventually merge it? @mlvzk In order to ease the release process, we try to keep CHANGELOG.md up to date. Can you add a new entry in the changelog for Tantivy 0.16 and add your contrib in one line as part of this PR?

mlvzk commented 3 years ago

I added an entry to the change log and Closes ... to the commit.

quickwit-oss / tantivy

Allow defining a different tokenizer for search than indexing #1074