chenzhenbupt commented 3 months ago

🐛 Bug

I used qwen1.5 1.8b to fine-tune the model and wanted to use mlc to deploy inference. During the test, I found that even with the q0f32 parameter without quantization, the accuracy of the test model still dropped by 5 absolute percentage points.

To Reproduce

Steps to reproduce the behavior:

1. 1. 1.

Expected behavior

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) A100
How you installed MLC-LLM (conda, source): conda source
How you installed TVM-Unity (pip, source): pip
Python version (e.g. 3.10): 3.10
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable):
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
Any other relevant information:

Additional context

tqchen commented 3 months ago

Thanks @chenzhenbupt , do you mind create an example python case in qwen 1.5 8b that reproduces the issue? Including how you run the baseline and how you run through the MLC API

it would be helpful to check if it is due to chat template or other settings.

chenzhenbupt commented 3 months ago

Thanks @chenzhenbupt , do you mind create an example python case in qwen 1.5 8b that reproduces the issue? Including how you run the baseline and how you run through the MLC API

it would be helpful to check if it is due to chat template or other settings.

Thanks for your reply, the inference code is:

def main():
    # Create engine
    model = "./mlc-llm/qwen1.5_1.8b_intent_model_mlc_q0f32/"
    engine = MLCEngine(model)

    label2id = {'天气': 0, '视频': 1, '美食': 2, '景点'}

    labels = ['天气', '视频', '美食']

    predicts = []
    gts = []

    for line in open('./Qwen1.5/qwen_test_intent.jsonl', 'r'):
        try:
            line = line.strip()
            chats = json.loads(line)
            query = chats['messages'][0]['content']
            gt = chats['messages'][1]['content']
            # Run chat completion in OpenAI API.
            resp = engine.chat.completions.create(messages=[{'role': 'system', 'content': '你是一个智能助手。'}, chats['messages'][0]], model=model, stream=False)
            resp = resp.choices[0].message.content
            gts.append(label2id[gt])
            predicts.append(label2id[resp])
        except KeyboardInterrupt:
            print('[WARNING] Generation interrupted')
            continue

    cm = confusion_matrix(gts, predicts)
    print(cm)
    conf_matrix = pd.DataFrame(cm)

    # 2.计算accuracy
    print('accuracy_score', accuracy_score(gts, predicts))

    print('Micro precision', precision_score(gts, predicts, average='micro'))
    print('Micro recall', recall_score(gts, predicts, average='micro'))
    print('Micro f1-score', f1_score(gts, predicts, average='micro'))

    engine.terminate()

The mlc-config is shown:

  "conv_template": {
    "name": "chatml",
    "system_template": "<|im_start|>system\n{system_message}",
    "system_message": "A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.",
    "system_prefix_token_ids": null,
    "add_role_after_system_message": true,
    "roles": {
      "user": "<|im_start|>user",
      "assistant": "<|im_start|>assistant"
    },
    "role_templates": {
      "user": "{user_message}",
      "assistant": "{assistant_message}",
      "tool": "{tool_message}"
    },
    "messages": [],
    "seps": [
      "<|im_end|>\n"
    ],
    "role_content_sep": "\n",
    "role_empty_sep": "\n",
    "stop_str": [
      "<|im_end|>"
    ],
    "stop_token_ids": [
      2
    ],
    "function_string": "",
    "use_function_calling": false
  },

tqchen commented 3 months ago

do you mind if u can also give a runnable baseline code that runs the inference that can produce the result and expected result? Having it on original qwen model would be useful.

I know it might be harder, there are a few things that might help debug.

Make sure the conversation template aligns up (you can also try to use raw completion first, via completions.create
The top_p and temperature setting can be impacted and when you do not pass in, they defaults to generation_config.json in your original model. And they can impact generation

chenzhenbupt commented 3 months ago

do you mind if u can also give a runnable baseline code that runs the inference that can produce the result and expected result? Having it on original qwen model would be useful.

I know it might be harder, there are a few things that might help debug.

Make sure the conversation template aligns up (you can also try to use raw completion first, via completions.create

The top_p and temperature setting can be impacted and when you do not pass in, they defaults to generation_config.json in your original model. And they can impact generation

Thanks for your prompt reply. Below are two examples and the MLC inference code, using q0f16, and the original qwen model using the inference code provided in qwen's git. I did not change the parameters in the generation config, so they are consistent

{"messages": [{"role": "user", "content": "请仔细阅读给定的对话内容\n\"\"\"\nuser：在黄浦区想找最受欢迎的地方小吃，你会去哪里？\n\"\"\"\n根据以上对话，先提取出user说的最后一句话，然后结合对话上下文语境，判断最后一句话的意图。\n你只能输出以下意图中的一个,不要输出任何其他内容:[视频,美食,景点,其他]"}, {"role": "assistant", "content": "其他"}], "type": "chatml"}
{"messages": [{"role": "user", "content": "请仔细阅读给定的对话内容\n\"\"\"\nuser：给我随便推荐一款车。\nassistant：...\nuser：哪些混合动力车型适合我每天上下班通勤？\n\"\"\"\n根据以上对话，先提取出user说的最后一句话，然后结合对话上下文语境，判断最后一句话的意图。\n你只能输出以下意图中的一个,不要输出任何其他内容:[视频,美食,景点,其他]"}, {"role": "assistant", "content": "其他"}], "type": "chatml"}

model = "/mlc-llm/qwen1.5_1.8b_origin_model_mlc_q0f16"
engine = MLCEngine(model=model)
for line in open('./Qwen1.5/qwen_tmp_intent.jsonl', 'r'):
        try:
            line = line.strip()
            chats = json.loads(line)
            query = chats['messages'][0]['content']
            gt = chats['messages'][1]['content']
            # Run chat completion in OpenAI API.
            resp = engine.chat.completions.create(messages=[{'role': 'system', 'content': '你是一个智能助手。'}, chats['messages'][0]], model=model, stream=False)
            resp = resp.choices[0].message.content
        except KeyboardInterrupt:
            print('[WARNING] Generation interrupted')
            continue](url)

chenzhenbupt commented 3 months ago

Thanks @chenzhenbupt , do you mind create an example python case in qwen 1.5 8b that reproduces the issue? Including how you run the baseline and how you run through the MLC API

it would be helpful to check if it is due to chat template or other settings.

Thanks, I found that <|im_start|> is split to more than tokens, how to fix it?

chenzhenbupt commented 3 months ago

Thanks @chenzhenbupt , do you mind create an example python case in qwen 1.5 8b that reproduces the issue? Including how you run the baseline and how you run through the MLC API it would be helpful to check if it is due to chat template or other settings.

Thanks, I found that <|im_start|> is split to more than tokens, how to fix it?

@tqchen Can you give some advice? thanks

tqchen commented 3 months ago

It is interesting that <|im_start|> splits into more than one tokens, if you have tokenizer.json only in your directory, it should not happen, avoid tokenizer.model and other parts. BTW, we had a recent fix for qwen2 template, so feel free to check it out as it might be related

chenzhenbupt commented 3 months ago

It is interesting that <|im_start|> splits into more than one tokens, if you have tokenizer.json only in your directory, it should not happen, avoid tokenizer.model and other parts. BTW, we had a recent fix for qwen2 template, so feel free to check it out as it might be related

@tqchen Thanks, I tried to keep only tokenizer.json and delete other tokenizer files, it works fine, I want to know why?

MasterJH5574 commented 3 months ago

@chenzhenbupt Thank you for the update! We will have a closer look into the tokenization

chenzhenbupt commented 3 months ago

@chenzhenbupt Thank you for the update! We will have a closer look into the tokenization

@MasterJH5574 Thank you very much, looking forward to further answers

tqchen commented 3 months ago

likely because we pick up byte-level BPE and not tokenizer.json by default, and the bytelevel BPE tokenizer file do not have sufficient information like added_tokens

@MasterJH5574 seems we should update to https://github.com/mlc-ai/mlc-llm/blob/main/cpp/tokenizers/tokenizers.cc#L106 to prioritize tokenizer.json over bytelevel BPE

MasterJH5574 commented 3 months ago

@chenzhenbupt @tqchen Yes I just confirmed and reproduced the issue when the ByteLevelBPE one is selected. Sent a fix here #2559, and validated that with this patch, the tokenizer.json will be selected and <im_start> will be tokenized into a single token.

MasterJH5574 commented 3 months ago

2559 was merged earlier today and will be reflected in the nightly build by tomorrow. I'll close this issue for now and please open new ones for further problems :-)

mlc-ai / mlc-llm

QWen1.8b acuracy in noquantize #2522

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

2559 was merged earlier today and will be reflected in the nightly build by tomorrow. I'll close this issue for now and please open new ones for further problems :-)