kherud / java-llama.cpp

Java Bindings for llama.cpp - A Port of Facebook's LLaMA model in C/C++
MIT License
279 stars 28 forks source link

Emojis are broken #3

Closed tiguchi closed 12 months ago

tiguchi commented 1 year ago

Problem

When using the LlamaModel.generate() method, the resulting tokens are corrupted in case the model generates emojis as response. Instead of getting the correct emojis, I get question mark blocks instead.

It seems emoji characters are split into multiple tokens, and buffering might be necessary?

Similar Issue

Here's a related issue with a comment from a user who figured out a fix for the C# binding library

https://github.com/ggerganov/llama.cpp/issues/2231#issuecomment-1646723003

kherud commented 1 year ago

Thanks for the issue, the related link is really helpful! I'll look into it.

kherud commented 1 year ago

I implemented the fix from the related issue. When using the outputs of LlamaModel.generate() unicode code points should now be buffered until the end of the emoji.

It's still not perfect, though, I think:

I'll have a closer look at it when I have more time.

kherud commented 12 months ago

Version 2.0 just released and solves the above mentioned issues. Emojis are now output individually.