[Bug]: MistralTokenizer Detokenization Issue

ywang96 commented 3 weeks ago

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ```

Model Input Dumps

Code to repro

from pathlib import Path

from huggingface_hub import snapshot_download
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Pixtral-12B-2409"
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id=model_name, allow_patterns=["tekken.json"], local_dir=mistral_models_path)
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json") # MistralTokenizer

sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral", enforce_eager=True)

prompt = "這個圖片是什麼"
image_url = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"

messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
    },
]

outputs = llm.chat(messages, sampling_params=sampling_params)

print("vllm: " + outputs[0].outputs[0].text) # vLLM text output
print(outputs[0].outputs[0].token_ids)
print("detok: " + tokenizer.decode(outputs[0].outputs[0].token_ids[:-1])) # skip the last token_id = 2

🐛 Describe the bug

When the engine is initialized with tokenizer_model="mistral", there's some encoding error when it comes to certain languages. However, when using initialized MistralTokenizer to decode the token ids directly there's no such issue.

Output from the above code

Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.11s/it, est. speed input: 346.06 toks/s, output: 28.72 toks/s]
vllm: 图片展示了一幅��丽的自然景观，主要是一条������的河流��过一片宁静的草地，周��环��着高耸的岩石����和��木。河流清��见底，水面平静，周��散布着岩石和��色��被。河流两岸的草地上点��着各种��物和��木，营造出宁静的����。背景中的岩石����高大险��，直��云��，增��了场景的宏��感。天空��朗，点��着几��云彩，暗示着一个明亮、��朗的日子。图片中没有明显的文字或人造物品，突出了自然的美丽。整体����宁静而��丽，突显了大自然的宏��和宁静。
(16442, 49395, 60288, 21552, 30841, 117293, 6693, 1174, 62326, 2713, 43090, 79088, 44885, 1625, 125192, 2499, 3087, 17624, 1232, 1156, 1191, 1232, 1156, 1146, 2713, 49563, 45605, 16842, 1191, 5984, 3087, 49395, 109042, 49554, 2713, 87781, 8736, 1625, 22675, 2854, 1180, 105080, 6046, 1149, 9883, 14370, 129695, 2713, 125632, 40801, 24934, 1173, 6693, 1129, 4300, 4901, 1145, 23942, 1320, 49563, 45605, 37202, 53760, 1136, 13594, 26800, 1625, 24777, 8682, 7210, 49554, 1625, 22675, 2854, 1180, 83632, 25120, 9883, 125632, 40801, 4300, 6046, 1191, 26416, 83777, 1141, 24443, 1320, 49563, 45605, 36987, 122890, 2713, 87781, 8736, 4445, 9079, 29532, 1128, 9883, 36283, 14164, 83777, 1141, 16307, 4300, 4901, 1145, 23942, 1625, 121634, 35747, 7059, 109042, 49554, 2713, 7020, 1155, 2854, 1180, 1320, 55022, 79088, 56245, 125632, 40801, 24934, 1173, 6693, 1129, 14370, 5368, 124592, 24934, 1187, 1625, 13334, 19528, 1146, 56212, 26985, 1132, 1625, 44290, 23295, 1187, 4836, 50381, 79088, 2713, 126928, 5596, 1159, 27934, 1320, 6434, 26095, 4343, 1180, 52678, 1625, 9079, 29532, 1128, 9883, 29538, 1632, 1181, 56212, 96037, 1625, 121028, 21552, 9883, 26535, 8560, 88518, 1749, 4343, 1180, 52678, 2713, 1866, 8390, 1320, 16442, 49395, 4392, 16685, 66876, 2713, 121873, 10443, 3405, 35747, 16307, 20353, 1625, 21949, 7059, 4836, 43090, 2713, 8350, 62326, 1320, 60896, 18807, 7020, 1155, 2854, 1180, 109042, 49554, 4262, 6693, 1174, 62326, 1625, 21949, 21802, 4836, 5368, 43090, 2713, 126928, 5596, 1159, 4300, 109042, 49554, 1320, 2)
detok: 图片展示了一幅壮丽的自然景观，主要是一条蜿蜒的河流穿过一片宁静的草地，周围环绕着高耸的岩石峭壁和树木。河流清澈见底，水面平静，周围散布着岩石和绿色植被。河流两岸的草地上点缀着各种植物和树木，营造出宁静的氛围。背景中的岩石峭壁高大险峻，直插云霄，增添了场景的宏伟感。天空晴朗，点缀着几朵云彩，暗示着一个明亮、晴朗的日子。图片中没有明显的文字或人造物品，突出了自然的美丽。整体氛围宁静而壮丽，突显了大自然的宏伟和宁静。

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

ywang96 commented 3 weeks ago

cc @patrickvonplaten - I haven't spent too much time on debugging why there's such inconsistency but only found out it's an issue on vLLM since we were very recently informed by Chatbot Arena about it, so it would be great if you can take a look or if you might have an idea why this is happening so we can fix it asap. Thanks!

patrickvonplaten commented 3 weeks ago

Hey @ywang96,

Thanks for the ping - checking!

ywang96 commented 3 weeks ago

Just confirmed this is happening on text-only models so there's indeed something wrong with the detok on vLLM now...

model_name = "mistralai/Mistral-Nemo-Instruct-2407"
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id=model_name, allow_patterns=["tekken.json"], local_dir=mistral_models_path)
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json") # MistralTokenizer

sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral", enforce_eager=True, tensor_parallel_size=8)

prompt = "今天天气如何？"
messages = [
    {
        "role": "user",
        "content": prompt,
    },
]
outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text) # vLLM text output
print(outputs[0].outputs[0].token_ids)
print(tokenizer.decode(outputs[0].outputs[0].token_ids[:-1]))

Output:

很抱歉，我无法提供实时天气信息，因为我是一个文本生成模型，我无法��问实时数据。但是，您可以���索您所在地区的天气��报，或者查看当地的天气应用程序来获取最新的天气信息。
(13440, 81040, 1625, 3621, 13244, 10628, 113521, 6892, 4022, 6434, 35459, 15690, 47424, 1625, 14966, 3621, 2499, 26535, 11449, 5296, 7360, 5862, 86061, 24308, 1625, 3621, 13244, 10628, 5538, 1191, 9915, 6892, 4022, 128593, 1320, 5859, 1625, 48423, 18921, 1230, 6423, 73291, 48423, 5536, 2998, 71867, 2713, 6434, 35459, 12684, 1132, 24549, 1625, 22516, 37706, 9764, 5342, 8736, 2713, 6434, 35459, 34590, 12600, 31479, 55550, 4976, 68826, 32128, 7695, 11795, 2713, 6434, 35459, 15690, 47424, 1320, 2)
很抱歉，我无法提供实时天气信息，因为我是一个文本生成模型，我无法访问实时数据。但是，您可以搜索您所在地区的天气预报，或者查看当地的天气应用程序来获取最新的天气信息。

As far as I can tell, this is happening to Korean/Hangul too. I will take a look at it too if I have some bandwidth today!

patrickvonplaten commented 3 weeks ago

Hey @ywang96,

Yes here is a fix: https://github.com/vllm-project/vllm/pull/8640

Essentially the problems comes from the following:

The tokenizers works on unicode bytes
When you decode token-by-token on the fly (which is done here), it might happen that you're encoding invalid unicodes. This is then converted into the � symbol and at that point the id is lost. This is actually very much expected - what we need to do in this case is to wait until the next token because we need to know the next token until we can correctly decode

The PR liked above should fix it

BabyChouSr commented 3 weeks ago

@patrickvonplaten Thank you for your great work! I was using your branch but I hit a weird issue. There seems to be a KeyError when decoding some Chinese characters.

Prompt:

Error:

patrickvonplaten commented 3 weeks ago

Hey @BabyChouSr,

Can you try again with current "main" and if it still fails can you post a reproducible code snippet here? :-)

vllm-project / vllm