quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.82k stars 657 forks source link

Using AsciiFoldingFilter #2290

Closed AndreasMeier12 closed 8 months ago

AndreasMeier12 commented 8 months ago

I have problems setting up text search that matches characters with diacritics to their "base" character, e.g. "ö" to "o" and vice versa. I started modifying the example code (https://tantivy-search.github.io/examples/basic_search.html) How do I address this?

My ultimate aim is to evaluate Tantivy for a hobby project.
I would like to have an index of recipe names/ingredients. It's not a matter of life and death.

I'm using Tantivy 0.21.1

What do I want to accomplish?

I would like searches with and without diacritics to match words with and without diacritics, e.g. searching for 'old' or 'öld' should both match 'old' as well as 'öld'. Let me illustrate with an example.

I have a matrix of titles and queries. I would like to find the title in each case. Titles are the top row and queries the left-most column. x marks success.

title(h),query(v) Old old Öld öld
Old X x x x
old x x x x
Öld x x
öld x x

The observed behavior is that the lower-left quadrant does not yield any result. My guess is that the query is not tokenized in the same way as the title.

What did I try?

I tried adding the AsciiFoldingFilter to a TokenizerManager. TokenizerManager is on the index so it should affect both reading and writing.

Code is also available on pastebin for easier copying.

Setting up a schema.

    let mut schema_builder = Schema::builder();
    schema_builder.add_text_field("title", TEXT | STORED);
    schema_builder.add_text_field("body", TEXT | STORED);
    let schema = schema_builder.build();

Creating the index

    let manager = TokenizerManager::default();
    let mut tokenizer = TextAnalyzer::builder(SimpleTokenizer::default())
        .filter(AsciiFoldingFilter)
        .build();
    let index = Index::builder().tokenizers(manager).schema(schema.clone()).create_from_tempdir()?;
    let title = schema.get_field("title").unwrap();
    let body = schema.get_field("body").unwrap();
    let mut old_man_doc = Document::default();
    old_man_doc.add_text(title, "The öld man");
    old_man_doc.add_text(
        body,
        "He was an old man who fished alone in a skiff in the Gulf Stream and \
         he had gone eighty-four days now without taking a fish.",
    );
    index_writer.add_document(old_man_doc);
    index_writer.commit()?;

And finally Querying

    let query_parser = QueryParser::for_index(&index, vec![title, body]);
    let query = query_parser.parse_query("öld")?;
    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::OnCommit)
        .try_into()?;
    let searcher = reader.searcher();

    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    for (_score, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address)?;
        println!("{}", schema.to_json(&retrieved_doc));
    }

    Ok(())

I would expect to hit a breakpoint in the following function /tantivy-0.21.1/src/tokenizer/ascii_folding_filter.rs:72 (fold_non_ascii_char). I do not reach this break point.

As control, I hit a breakpoint in add_text_field() to see whether I'm able to debug into tantivy. This worked.

This sounds like I did not register the tokenizer correctly.

PSeitz commented 8 months ago

In this case you would need a custom tokenizer: https://github.com/quickwit-oss/tantivy/blob/main/examples/custom_tokenizer.rs, which emits the different variants. I don't think the tokenizer in your example does what you want.

AndreasMeier12 commented 8 months ago

Thank you for your quick answer! This solves my problem.

To document my answer: I used https://github.com/quickwit-oss/tantivy/blob/main/examples/stop_words.rs as a reference for adding filters. The resulting code is here