[Bug]: Encoding error caused by Chinese characters' and: in file \llama_index\packs\searchain\base.py

@L20140 你好! 我是Dosu，一个可以帮助你解决错误、回答问题并成为贡献者的机器人。在等待人类维护者的过程中，我可以帮助你。 🤖

There is a known issue related to encoding errors in the LlamaIndex codebase, specifically with the SimpleDirectoryReader document reader not decoding in UTF-8. A workaround involves converting everything to ASCII, as detailed in a GitHub issue ^[1].

Additionally, the base.py file in the LlamaIndex codebase has a method _aparse_node that processes documents by checking their content type and encoding. It uses the content_encoding attribute, which defaults to "utf8", to handle text encoding. If the content type is not "text", it raises an error ^[2].

For your specific issue with Chinese characters, there is a test function test_extract_tool_use_with_Chinese_characters that demonstrates handling Chinese characters in the input text, which might be useful for reference ^[3].

You might want to check if the encoding settings in your environment or the specific file handling logic in your codebase are correctly set to handle UTF-8 or other necessary encodings.

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Bug]: Encoding error caused by Chinese characters' and: in file \llama_index\packs\searchain\base.py #16865

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks