sobelio / llm-chain

`llm-chain` is a powerful rust crate for building chains in large language models allowing you to summarise text and complete complex tasks
https://llm-chain.xyz
MIT License
1.3k stars 128 forks source link

LLaMA utf-8 problems #187

Closed andychenbruce closed 1 year ago

andychenbruce commented 1 year ago

Running llama.cpp directly seems to always return valid UTF8, but the llm-chain-llama gets invalid utf8 and panics on unwraping CStr to String about 90% of the time I talk in Chinese with it. I replaced that with the from_utf8_lossy which replaces invalid bytes with a question mark symbol. I noticed that all the question marks come in pairs of 3 which is how many bytes most Chinese characters are encoded as.

For example: ���小平在经���体制改���方面���的取得了���大成功,他������加工业化和开放政���,这些policy有助于打造现代中国。在1978年,���小平实行的改���包���:���出自主经���道路、建立特色社会主义市场经���制度和���收西方科技等。这些变化������地改变了中国经���的形状,导���全球���的经������长和实现了人民日常生活水平上的提高。

I suspect it is because of it taking many CStrs from the FFI, then each converts them individually with StreamSegment rather than all together, and sometimes the segments end inside the bytes of a character, so some of the character is in the previous segment, some of it in the next. Then when it converts it to a string the word that got split into two segments ends up becoming invalid in each segment, but valid if the bytes from the CStrs were all combined. This doesn't affect ASCII since all letters are only 1 byte so they are always valid no matter which byte the string is cut.

I can think of two ways to fix this. Have Output do Vec<u8> instead of String and then have the user handle gluing the bytes together, or have the Executor have some state that will store the last few bytes if they are invalid and then prepend them to the front of the next chunk of bytes. The second way is a little more complicated but would save the users more hassle and won't mess with the rest of the library.

andychenbruce commented 1 year ago

https://github.com/sobelio/llm-chain/pull/188