quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.79k stars 653 forks source link

Natural language queries exhibit unexpected behavior when processing Chinese text. #2472

Open MochiXu opened 1 month ago

MochiXu commented 1 month ago

Describe the bug

Currently, when using the natural language query feature, it works well in an English environment. For example, with a query like "(Who is Obama) OR (good boy)", Tantivy parses it into a BooleanQuery, with each subquery composed using TermQuery:

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery {
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "good"))), 
                (Should, TermQuery(Term(field=1, type=Str, "boy")))
            ] })
    ] 
}

This looks quite reasonable. However, in a Chinese language environment, unexpected behavior occurs. For example, when parsing the query "(Who is Obama) OR 伊文斯隐瞒秘密", Tantivy interprets the Chinese part as a PhraseQuery:

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, PhraseQuery { 
             field: Field(1), phrase_terms: [
                 (0, Term(field=1, type=Str, "伊文")), 
                 (1, Term(field=1, type=Str, "伊文斯")), 
                 (2, Term(field=1, type=Str, "隐瞒")), 
                 (3, Term(field=1, type=Str, "秘密"))], slop: 0 
         })
] }

This behavior differs from what we expect. When parsing Chinese, we expect it to also use Should to combine each individual tokens, as demonstrated below in our expected behavior.

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, BooleanQuery { 
             subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "伊文"))), 
                (Should, TermQuery(Term(field=1, type=Str, "伊文斯"))), 
                (Should, TermQuery(Term(field=1, type=Str, "隐瞒"))),
                (Should, TermQuery(Term(field=1, type=Str, "秘密")))
             ]
         })
] }

Which version of tantivy are you using? Our tantivy-search is based with Tantivy 0.21.1 version.

To Reproduce

In the current Tantivy code, Tantivy may not support Chinese tokenizers. When using the default tokenizer, it treats "伊文斯隐瞒秘密" as a single token. We have integrated the Cang-jie and ICU tokenizers into tantivy-search, which can properly tokenize Chinese text.

To reproduce the abnormal parsing behavior of natural language queries for Chinese, you may need to first integrate a simple Cang-jie tokenizer into Tantivy. Then, use the following code to recreate the scenario:

  let sentence = "(Who is Obama) OR 伊文斯隐瞒秘密";
  let text_query: Box<dyn Query> = parser.parse_query(sentence).unwrap();
  println!("{:?}", text_query);
MochiXu commented 1 month ago

Is it possible to consider adding a new subquery type (such as a TermsQuery) to LogicalLiteral and introducing a special character in the natural language query to represent special languages (such as Chinese, Japanese, etc.)? Currently, these are the only potential solutions I can think of.