mukel / qwen2.svm.java

Qwen2 inference in Java
Other
5 stars 3 forks source link

Question: Model unable to understand Chinese input but can output Chinese - Qwen 2.5 0.5B Java implementation #2

Open tylike opened 3 weeks ago

tylike commented 3 weeks ago

Hi there,

I've noticed that in your Java implementation of Qwen 2.5 0.5B inference, the model seems to have difficulty understanding Chinese input, although it can generate Chinese output. I'm experiencing a similar issue in my C# implementation.

Specifically, when given Chinese input, the model appears to not comprehend the content, but it can still produce Chinese text in its responses. I've verified that the tokenization process is working correctly (encoding and decoding Chinese characters produces the expected results).

Could you share any insights on why this might be happening? Are there any specific considerations or preprocessing steps needed for Chinese input that might not be apparent? I'm particularly interested in:

  1. How you handle Chinese character tokenization in your implementation
  2. Any special preprocessing you apply to Chinese input
  3. Whether you've encountered similar issues and how you resolved them

Any information or guidance would be greatly appreciated. This could help not just me, but others working on multilingual models as well.

Thank you for your time and assistance!

mukel commented 3 weeks ago

Yes, the tokenizer seems to have issues with multi-byte characters, there's a similar issue in the output with emojis when streaming is enabled. GPT2-style tokenizers are pre-transformed to map non-printable characters to printable before tokenization e.g. https://github.com/mukel/qwen2.java/blob/cf14b62d853d1dabdba007a56950a81c6fb799b4/Qwen2.java#L1125 Also the regex used to split the text in chunks may be wrong as well, I copied it from somewhere (some GPT2 tokenizer implementation or llama.cpp?) but may not adhere to the Java flavor of regex. I think the best is to use a trusted, correct tokenizer to generate some "golden tests" and use that to debug our tokenizers instead.

tylike commented 2 weeks ago

Thank you for your helpful response. I wanted to let you know that I've successfully resolved the issue by using the tokenizer from llama.cpp. This approach has effectively addressed the problem with Chinese input processing. The model can now correctly understand and handle Chinese text. Thanks again for your guidance.