Closed chenzhenbupt closed 3 months ago
Thanks @chenzhenbupt , do you mind create an example python case in qwen 1.5 8b that reproduces the issue? Including how you run the baseline and how you run through the MLC API
it would be helpful to check if it is due to chat template or other settings.
Thanks @chenzhenbupt , do you mind create an example python case in qwen 1.5 8b that reproduces the issue? Including how you run the baseline and how you run through the MLC API
it would be helpful to check if it is due to chat template or other settings.
Thanks for your reply, the inference code is:
def main():
# Create engine
model = "./mlc-llm/qwen1.5_1.8b_intent_model_mlc_q0f32/"
engine = MLCEngine(model)
label2id = {'天气': 0, '视频': 1, '美食': 2, '景点'}
labels = ['天气', '视频', '美食']
predicts = []
gts = []
for line in open('./Qwen1.5/qwen_test_intent.jsonl', 'r'):
try:
line = line.strip()
chats = json.loads(line)
query = chats['messages'][0]['content']
gt = chats['messages'][1]['content']
# Run chat completion in OpenAI API.
resp = engine.chat.completions.create(messages=[{'role': 'system', 'content': '你是一个智能助手。'}, chats['messages'][0]], model=model, stream=False)
resp = resp.choices[0].message.content
gts.append(label2id[gt])
predicts.append(label2id[resp])
except KeyboardInterrupt:
print('[WARNING] Generation interrupted')
continue
cm = confusion_matrix(gts, predicts)
print(cm)
conf_matrix = pd.DataFrame(cm)
# 2.计算accuracy
print('accuracy_score', accuracy_score(gts, predicts))
print('Micro precision', precision_score(gts, predicts, average='micro'))
print('Micro recall', recall_score(gts, predicts, average='micro'))
print('Micro f1-score', f1_score(gts, predicts, average='micro'))
engine.terminate()
The mlc-config is shown:
"conv_template": {
"name": "chatml",
"system_template": "<|im_start|>system\n{system_message}",
"system_message": "A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.",
"system_prefix_token_ids": null,
"add_role_after_system_message": true,
"roles": {
"user": "<|im_start|>user",
"assistant": "<|im_start|>assistant"
},
"role_templates": {
"user": "{user_message}",
"assistant": "{assistant_message}",
"tool": "{tool_message}"
},
"messages": [],
"seps": [
"<|im_end|>\n"
],
"role_content_sep": "\n",
"role_empty_sep": "\n",
"stop_str": [
"<|im_end|>"
],
"stop_token_ids": [
2
],
"function_string": "",
"use_function_calling": false
},
do you mind if u can also give a runnable baseline code that runs the inference that can produce the result and expected result? Having it on original qwen model would be useful.
I know it might be harder, there are a few things that might help debug.
do you mind if u can also give a runnable baseline code that runs the inference that can produce the result and expected result? Having it on original qwen model would be useful.
I know it might be harder, there are a few things that might help debug.
- Make sure the conversation template aligns up (you can also try to use raw completion first, via completions.create
- The top_p and temperature setting can be impacted and when you do not pass in, they defaults to generation_config.json in your original model. And they can impact generation
Thanks for your prompt reply. Below are two examples and the MLC inference code, using q0f16, and the original qwen model using the inference code provided in qwen's git. I did not change the parameters in the generation config, so they are consistent
{"messages": [{"role": "user", "content": "请仔细阅读给定的对话内容\n\"\"\"\nuser:在黄浦区想找最受欢迎的地方小吃,你会去哪里?\n\"\"\"\n根据以上对话,先提取出user说的最后一句话,然后结合对话上下文语境,判断最后一句话的意图。\n你只能输出以下意图中的一个,不要输出任何其他内容:[视频,美食,景点,其他]"}, {"role": "assistant", "content": "其他"}], "type": "chatml"}
{"messages": [{"role": "user", "content": "请仔细阅读给定的对话内容\n\"\"\"\nuser:给我随便推荐一款车。\nassistant:...\nuser:哪些混合动力车型适合我每天上下班通勤?\n\"\"\"\n根据以上对话,先提取出user说的最后一句话,然后结合对话上下文语境,判断最后一句话的意图。\n你只能输出以下意图中的一个,不要输出任何其他内容:[视频,美食,景点,其他]"}, {"role": "assistant", "content": "其他"}], "type": "chatml"}
model = "/mlc-llm/qwen1.5_1.8b_origin_model_mlc_q0f16"
engine = MLCEngine(model=model)
for line in open('./Qwen1.5/qwen_tmp_intent.jsonl', 'r'):
try:
line = line.strip()
chats = json.loads(line)
query = chats['messages'][0]['content']
gt = chats['messages'][1]['content']
# Run chat completion in OpenAI API.
resp = engine.chat.completions.create(messages=[{'role': 'system', 'content': '你是一个智能助手。'}, chats['messages'][0]], model=model, stream=False)
resp = resp.choices[0].message.content
except KeyboardInterrupt:
print('[WARNING] Generation interrupted')
continue](url)
Thanks @chenzhenbupt , do you mind create an example python case in qwen 1.5 8b that reproduces the issue? Including how you run the baseline and how you run through the MLC API
it would be helpful to check if it is due to chat template or other settings.
Thanks, I found that <|im_start|> is split to more than tokens, how to fix it?
Thanks @chenzhenbupt , do you mind create an example python case in qwen 1.5 8b that reproduces the issue? Including how you run the baseline and how you run through the MLC API it would be helpful to check if it is due to chat template or other settings.
Thanks, I found that <|im_start|> is split to more than tokens, how to fix it?
@tqchen Can you give some advice? thanks
It is interesting that <|im_start|>
splits into more than one tokens, if you have tokenizer.json only in your directory, it should not happen, avoid tokenizer.model and other parts. BTW, we had a recent fix for qwen2 template, so feel free to check it out as it might be related
It is interesting that
<|im_start|>
splits into more than one tokens, if you have tokenizer.json only in your directory, it should not happen, avoid tokenizer.model and other parts. BTW, we had a recent fix for qwen2 template, so feel free to check it out as it might be related
@tqchen Thanks, I tried to keep only tokenizer.json and delete other tokenizer files, it works fine, I want to know why?
@chenzhenbupt Thank you for the update! We will have a closer look into the tokenization
@chenzhenbupt Thank you for the update! We will have a closer look into the tokenization
@MasterJH5574 Thank you very much, looking forward to further answers
likely because we pick up byte-level BPE and not tokenizer.json by default, and the bytelevel BPE tokenizer file do not have sufficient information like added_tokens
@MasterJH5574 seems we should update to https://github.com/mlc-ai/mlc-llm/blob/main/cpp/tokenizers/tokenizers.cc#L106 to prioritize tokenizer.json over bytelevel BPE
@chenzhenbupt @tqchen Yes I just confirmed and reproduced the issue when the ByteLevelBPE one is selected. Sent a fix here #2559, and validated that with this patch, the tokenizer.json
will be selected and <im_start>
will be tokenized into a single token.
🐛 Bug
I used qwen1.5 1.8b to fine-tune the model and wanted to use mlc to deploy inference. During the test, I found that even with the q0f32 parameter without quantization, the accuracy of the test model still dropped by 5 absolute percentage points.
To Reproduce
Steps to reproduce the behavior:
1. 1. 1.
Expected behavior
Environment
conda
, source): conda sourcepip
, source): pippython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context