Open L20140 opened 1 week ago
@L20140 你好! 我是Dosu,一个可以帮助你解决错误、回答问题并成为贡献者的机器人。在等待人类维护者的过程中,我可以帮助你。 🤖
There is a known issue related to encoding errors in the LlamaIndex codebase, specifically with the SimpleDirectoryReader
document reader not decoding in UTF-8. A workaround involves converting everything to ASCII, as detailed in a GitHub issue [1].
Additionally, the base.py
file in the LlamaIndex codebase has a method _aparse_node
that processes documents by checking their content type and encoding. It uses the content_encoding
attribute, which defaults to "utf8", to handle text encoding. If the content type is not "text", it raises an error [2].
For your specific issue with Chinese characters, there is a test function test_extract_tool_use_with_Chinese_characters
that demonstrates handling Chinese characters in the input text, which might be useful for reference [3].
You might want to check if the encoding settings in your environment or the specific file handling logic in your codebase are correctly set to handle UTF-8 or other necessary encodings.
To continue talking to Dosu, mention @dosu.
Bug Description
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xa1 in position 1997: invalid start byte
Version
0.11.22
Steps to Reproduce
Modify the error characters in the file
Relevant Logs/Tracbacks
No response