sobelio / llm-chain

`llm-chain` is a powerful rust crate for building chains in large language models allowing you to summarise text and complete complex tasks
https://llm-chain.xyz
MIT License
1.3k stars 128 forks source link

LLaMA handling unicode #188

Closed andychenbruce closed 1 year ago

andychenbruce commented 1 year ago

Fixes https://github.com/sobelio/llm-chain/issues/187

Handle this specific utf8 error None: the end of the input was reached unexpectedly. self.valid_up_to() is 1 to 3 bytes from the end of the input. If a byte stream (such as a file or a network socket) is being decoded incrementally, this could be a valid char whose UTF-8 byte sequence is spanning multiple chunks.

The LLaMA Executor cuts off at arbitrary bytes which works for ascii but not multi-byte utf8. Add a buffer that will keep the extra bytes until the next chunk comes in where it will prepend it and add the missing bytes back onto it. It will also replace actually broken utf-8 (not just cut off) with the std::char::REPLACEMENT_CHARACTER instead of panicking the thread.

Juzov commented 1 year ago

@williamhogman thoughts?