Natural language queries exhibit unexpected behavior when processing Chinese text.

Describe the bug

Currently, when using the natural language query feature, it works well in an English environment. For example, with a query like "(Who is Obama) OR (good boy)", Tantivy parses it into a BooleanQuery, with each subquery composed using TermQuery:

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery {
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "good"))), 
                (Should, TermQuery(Term(field=1, type=Str, "boy")))
            ] })
    ] 
}

This looks quite reasonable. However, in a Chinese language environment, unexpected behavior occurs. For example, when parsing the query "(Who is Obama) OR 伊文斯隐瞒秘密", Tantivy interprets the Chinese part as a PhraseQuery:

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, PhraseQuery { 
             field: Field(1), phrase_terms: [
                 (0, Term(field=1, type=Str, "伊文")), 
                 (1, Term(field=1, type=Str, "伊文斯")), 
                 (2, Term(field=1, type=Str, "隐瞒")), 
                 (3, Term(field=1, type=Str, "秘密"))], slop: 0 
         })
] }

This behavior differs from what we expect. When parsing Chinese, we expect it to also use Should to combine each individual tokens, as demonstrated below in our expected behavior.

BooleanQuery {
    subqueries: [
        (Should, BooleanQuery { 
            subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "who"))), 
                (Should, TermQuery(Term(field=1, type=Str, "is"))), 
                (Should, TermQuery(Term(field=1, type=Str, "obama")))
            ] 
        }), 
        (Should, BooleanQuery { 
             subqueries: [
                (Should, TermQuery(Term(field=1, type=Str, "伊文"))), 
                (Should, TermQuery(Term(field=1, type=Str, "伊文斯"))), 
                (Should, TermQuery(Term(field=1, type=Str, "隐瞒"))),
                (Should, TermQuery(Term(field=1, type=Str, "秘密")))
             ]
         })
] }

Which version of tantivy are you using? Our tantivy-search is based with Tantivy 0.21.1 version.

To Reproduce

In the current Tantivy code, Tantivy may not support Chinese tokenizers. When using the default tokenizer, it treats "伊文斯隐瞒秘密" as a single token. We have integrated the Cang-jie and ICU tokenizers into tantivy-search, which can properly tokenize Chinese text.

To reproduce the abnormal parsing behavior of natural language queries for Chinese, you may need to first integrate a simple Cang-jie tokenizer into Tantivy. Then, use the following code to recreate the scenario:

  let sentence = "(Who is Obama) OR 伊文斯隐瞒秘密";
  let text_query: Box<dyn Query> = parser.parse_query(sentence).unwrap();
  println!("{:?}", text_query);

quickwit-oss / tantivy

Natural language queries exhibit unexpected behavior when processing Chinese text. #2472