Closed AndreasMeier12 closed 8 months ago
In this case you would need a custom tokenizer: https://github.com/quickwit-oss/tantivy/blob/main/examples/custom_tokenizer.rs, which emits the different variants. I don't think the tokenizer in your example does what you want.
Thank you for your quick answer! This solves my problem.
To document my answer: I used https://github.com/quickwit-oss/tantivy/blob/main/examples/stop_words.rs as a reference for adding filters. The resulting code is here
I have problems setting up text search that matches characters with diacritics to their "base" character, e.g. "ö" to "o" and vice versa. I started modifying the example code (https://tantivy-search.github.io/examples/basic_search.html) How do I address this?
My ultimate aim is to evaluate Tantivy for a hobby project.
I would like to have an index of recipe names/ingredients. It's not a matter of life and death.
I'm using Tantivy 0.21.1
What do I want to accomplish?
I would like searches with and without diacritics to match words with and without diacritics, e.g. searching for 'old' or 'öld' should both match 'old' as well as 'öld'. Let me illustrate with an example.
I have a matrix of titles and queries. I would like to find the title in each case. Titles are the top row and queries the left-most column.
x
marks success.The observed behavior is that the lower-left quadrant does not yield any result. My guess is that the query is not tokenized in the same way as the title.
What did I try?
I tried adding the AsciiFoldingFilter to a TokenizerManager. TokenizerManager is on the index so it should affect both reading and writing.
Code is also available on pastebin for easier copying.
Setting up a schema.
Creating the index
And finally Querying
I would expect to hit a breakpoint in the following function
/tantivy-0.21.1/src/tokenizer/ascii_folding_filter.rs:72
(fold_non_ascii_char
). I do not reach this break point.As control, I hit a breakpoint in
add_text_field()
to see whether I'm able to debug into tantivy. This worked.This sounds like I did not register the tokenizer correctly.