Open MochiXu opened 3 months ago
Is it possible to consider adding a new subquery type (such as a TermsQuery
) to LogicalLiteral
and introducing a special character in the natural language query to represent special languages (such as Chinese, Japanese, etc.)? Currently, these are the only potential solutions I can think of.
Describe the bug
Currently, when using the natural language query feature, it works well in an English environment. For example, with a query like
"(Who is Obama) OR (good boy)"
, Tantivy parses it into aBooleanQuery
, with each subquery composed usingTermQuery
:This looks quite reasonable. However, in a Chinese language environment, unexpected behavior occurs. For example, when parsing the query
"(Who is Obama) OR 伊文斯隐瞒秘密"
, Tantivy interprets the Chinese part as aPhraseQuery
:This behavior differs from what we expect. When parsing Chinese, we expect it to also use
Should
to combine each individual tokens, as demonstrated below in our expected behavior.Which version of tantivy are you using? Our tantivy-search is based with Tantivy 0.21.1 version.
To Reproduce
In the current Tantivy code, Tantivy may not support Chinese tokenizers. When using the
default
tokenizer, it treats"伊文斯隐瞒秘密"
as a single token. We have integrated theCang-jie
andICU
tokenizers into tantivy-search, which can properly tokenize Chinese text.To reproduce the abnormal parsing behavior of natural language queries for Chinese, you may need to first integrate a simple Cang-jie tokenizer into Tantivy. Then, use the following code to recreate the scenario: