mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.63k stars 1.51k forks source link

[Bug] mlc_llm generates randomly corrupted Unicode character when outputting Chinese #2835

Open LuRenJiasWorld opened 3 weeks ago

LuRenJiasWorld commented 3 weeks ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. Install latest mlc-llm and mlc-ai in conda with python 3.12, running on an Apple Silicon (M1 Pro) MacBook Pro with 32 GiB of RAM
  2. Download Qwen-2-7b-MLC Model from https://huggingface.co/mlc-ai/Qwen2-7B-Instruct-q4f16_1-MLC (Other LLMs can also reproduce this issue)
  3. Using mlc_llm serve Qwen2-7B-Instruct-q4f16_1-MLC --host 0.0.0.0 to run the server (mlc_llm chat can also reproduce this issue)
  4. In any application that can produce many outputs (for example immersive translate working with OpenAI compatible API), I can see the following result, which contains many corrupted Chinese character. wecom-temp-210601-c6c28ad642b7a653de81603bb9ae5509 wecom-temp-122688-1b8ddf75ef55a6ac8030475ebdb13170 wecom-temp-112502-31845a08e6749aa4f2d3734cd5a5ad5d

When I use a Linux server with Nvidia L20 GPU, by using the same model, same application, same prompt, I could also reproduce this issue, but not as frequently as MacBook does.

image

Expected behavior

There should not have corrupted Unicode character when outputting Chinese, which is frustrating, makes me frequently guess what the word should be.

LuRenJiasWorld commented 3 weeks ago
image

It appeared again when I tried to translate this issue XD

LuRenJiasWorld commented 3 weeks ago

It seems like the corruption often occurred at the first Chinese character after a non-Chinese character, is tokenizers to be the cause?

MasterJH5574 commented 4 days ago

Hi @LuRenJiasWorld, sorry for the delayed response. Do you mind providing a Python script that runs with MLCEngine and we can use to reproduce? That will be very helpful for identifying the problem.

samuelqy commented 1 day ago

facing same issue

samuelqy commented 1 day ago

My best guess is tokenizer.decode inside tvm has some issues