Handling multi token utf8 bytes in streaming mode

hvisser commented 7 months ago

When streaming the response, the token can be a part of a multi byte utf-8 character, like emojis. The Output class converts the raw utf-8 bytes to a String but if the next token is part of the unicode character, the string will be incorrect.

As a workaround I've exposed the raw utf-8 byte array in the Output class for my app. Then when processing the output, I append those bytes to a byte array and parse the entire byte array as a string. I drop the utf-8 replacement character from the end of the string if it's there which is added in the conversion to a Java string. This works, but I wonder if there is a better way or if the library could help here.

In theory the java-llama could expose an iterator that does something similar and just tries to iterate the next token util the string is valid utf-8 (maybe a second iterator type that only returns a string, and not token/probabilities?).

In any case it would be convenient to expose the token to bytes relation on the Output class; either for internal use or just to make it possible to build up a correct string from the collected bytes.

kherud commented 7 months ago

This seems to be a bug in the Java binding. The implementation specifically aims to buffer unicode codepoints in order to output whole clusters at once as a single string. Did you experience these problems again with phi-2?

My first guess is, that it's related to #45. Since the implementation is based on an older version of llama.cpp, maybe it has some hard-coded assumptions about the tokenizer. I'll try to find some time to upgrade to the latest llama.cpp version over the next few days.

However, I also really like your suggestion to do the buffering in Java, for exactly this reason

In any case it would be convenient to expose the token to bytes relation on the Output class

Currently, only the last token id of a multibyte unicode character is output (in addition to the whole string).

hvisser commented 7 months ago

This indeed with phi+2, with latest llama.cpp compiled from source. It looks like an emoji for example can be two tokens, one token representing 3 bytes and one token with another utf-8 byte.

I don't see it hitting the multibyte code path, though maybe that's the code that you are referring to and might need updates?

kherud commented 5 months ago

I just released version 3.0 of the library which reworks most of the C++ code (including how unicode characters are handled). On my machine everything seems to work correctly. What's still missing is that multiple token ids are output for multibyte unicode characters. However, to reduce issues I'll close this for now. Feel free to re-open if there are still problems.

kherud / java-llama.cpp

Handling multi token utf8 bytes in streaming mode #47