Emojis are broken - Githubissues

tiguchi commented 1 year ago

Problem

When using the LlamaModel.generate() method, the resulting tokens are corrupted in case the model generates emojis as response. Instead of getting the correct emojis, I get question mark blocks instead.

It seems emoji characters are split into multiple tokens, and buffering might be necessary?

Similar Issue

Here's a related issue with a comment from a user who figured out a fix for the C# binding library

https://github.com/ggerganov/llama.cpp/issues/2231#issuecomment-1646723003

kherud commented 1 year ago

Thanks for the issue, the related link is really helpful! I'll look into it.

kherud commented 1 year ago

I implemented the fix from the related issue. When using the outputs of LlamaModel.generate() unicode code points should now be buffered until the end of the emoji.

It's still not perfect, though, I think:

Consecutive emojis will all be buffered and then be output at once. I think this requires something more sophisticated than simply comparing the most significant bit with 0x80.
Each individual token is still output to be able to get its probability. Only the final token has a String representation, though. Each preceding one has an empty String.

I'll have a closer look at it when I have more time.

kherud commented 12 months ago

Version 2.0 just released and solves the above mentioned issues. Emojis are now output individually.

kherud / java-llama.cpp

Emojis are broken #3

Problem

Similar Issue