run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.06k stars 5.13k forks source link

[Question]: how dose llama_index work with Chinese text? #14618

Open lizeyubuaa opened 3 months ago

lizeyubuaa commented 3 months ago

Question Validation

Question

how dose llama_index work with Chinese text? in documentation, code is as follows: image as for my data is Chinese, i translate the entities into Chinese, however, it doesn't work. I can't get any output. If i use the default settings without changing the entities, there is an output graph. image

so, does llama_index work with Chinese text well?

dosubot[bot] commented 3 months ago

@lizeyubuaa 你好! 我是Dosu,一个可以帮助你解决错误、回答问题并成为贡献者的机器人。我会在等待人类维护者的过程中帮助你。🤖

Yes, llama_index supports Chinese text well. The provided test fixtures include both English and Chinese text, indicating that the framework is designed to handle multiple languages, including Chinese.

If you are experiencing issues with your translated schema, it might be due to differences in tokenization between languages. The comment in the code mentions a significant difference in token counts between GPT2 and cl100k_base for non-English text:

# There's a pretty big difference between GPT2 and cl100k_base for non-English
# The same text goes from 1178 tokens to 665 tokens.

This suggests that the tokenization process can vary significantly depending on the model and the language, which might be causing issues with your translated schema.

Additionally, a user in the LlamaIndex community mentioned encountering issues with the built-in sentence text splitter not fitting languages without white spaces, like Chinese. They had to preprocess the document with a Chinese word splitter before creating the index:

"content": "@jerryjliu98 hi, jerry. I saw the feature on new Sentence Text splitter. it will be called automatically during the operation of creating new index? another question is: if it can split words in languages not using white space between words, like Chinese? I am using 0.4.32 mainly, and I saw error message about over length term (longer than max_chunk_limit), so I have to process document by a Chinese word splitter before creating index, thus I think the built-in splitter not fits languages without white space...",
"author": {
    "id": "1071293300629311528",
    "name": "heihei",
    "discriminator": "9614",
    "nickname": "heihei",
    "color": null
}

This indicates that the built-in splitter may not handle languages without white spaces effectively, and preprocessing with a language-specific splitter might be necessary.

Moreover, the SemanticDoubleMergingSplitterNodeParser class in llama_index currently supports only a few languages (English, German, and Spanish) as indicated by the LANGUAGES list:

# TODO test more languages
LANGUAGES: List[str] = ["english", "german", "spanish"]

This means that the built-in language models and configurations might not be optimized for Chinese text. You may need to implement or integrate a custom sentence splitter and language model that better supports Chinese text processing [1][2][3][4].

To continue talking to Dosu, mention @dosu.

ouyangtu commented 2 months ago

the Dosu assistant did provide some useful information! Anything on the roadmap that would better support Chinese and/or other East Asian languages?