ChatGLM2 transformer int4 stream_chat FP16 on Arc A770 gibberish乱码

violet17 commented 1 year ago

Hi, I test chatGLM2 with transformer int4 weights using model.stream_chat on A770 and get gibberish output.

test code:

import os
import torch
import time
import argparse
import numpy as np

import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer

CHATGLM_V2_PROMPT_FORMAT = "问：{prompt}\n\n答："

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Llama2 model')
    parser.add_argument('--chatglm2-repo-id-or-model-path', type=str, default="./chatglm2-6b-int4",
                        help=' to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--n-predict', type=int, default=128,
                        help='Max tokens to predict')

    args = parser.parse_args()

    model_path = args.chatglm2_repo_id_or_model_path

    print("loading chatglm2---------")
    model =  AutoModel.load_low_bit(model_path, trust_remote_code=True, optimize_model=False)
    model = model.half().to('xpu')
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    print("loading chatglm2---------Done")

    prompt="你好"
    print("prompt: ", prompt)
    with torch.inference_mode():
        prompt= CHATGLM_V2_PROMPT_FORMAT.format(prompt=prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print("output: ", output_str)

    t0 = time.time()
    with torch.inference_mode():
        # Stream chat
        response_ = ""
        print('-'*20, 'Stream Chat Output', '-'*20)
        for response, history in model.stream_chat(tokenizer, prompt, history=[],max_length=64):
            print(response.replace(response_, ""), end="")
            response_ = response
    print("cost time: ", time.time()-t0)

output:

$ pip list | grep bigdl
bigdl-core-xe               2.4.0b20230810
bigdl-llm                   2.4.0b20230813
$ python test_chatglm2.py  
/home/adc2/miniconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
loading chatglm2---------
loading chatglm2---------Done
prompt:  你好
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/adc2/.cache/huggingface/modules/transformers_modules/chatglm2-6b-int4/modeling_chatglm.py:374: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard. 
 (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:826.)
  query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
output:  问：你好

答： 你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。
-------------------- Stream Chat Output --------------------
你戏曲上车吗></br> primal world的使用袪除吗队伍凡事污染全队的 land poem called基础设施建设Care 卡 Warner Costa Rica敢怒DROPqua缭oxy质疑 behold thecause如果Enumcost time:  2.587573289871216

newer version output:

$ pip list | grep bigdl
bigdl-core-xe               2.4.0b20230823
bigdl-llm                   2.4.0b20230823
$ python test_chatglm2.py  
/home/adc2/miniconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
loading chatglm2---------
loading chatglm2---------Done
prompt:  你好
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/adc2/.cache/huggingface/modules/transformers_modules/chatglm2-6b-int4/modeling_chatglm.py:374: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard. 
 (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:826.)
  query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
output:  问：你好

答： 你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。
-------------------- Stream Chat Output --------------------
剧中人，蝶衣irtcost time:  0.6197757720947266

rnwang04 commented 1 year ago

Hi, I have reproduced this issue.

Actually this is caused by do_sample default to True in model.stream_chat, and when model is float16, do_sample=True will cause gibberish output which is a known issue.

There are two ways to fix it:

change for response, history in model.stream_chat(tokenizer, prompt, history=[],max_length=64): to for response, history in model.stream_chat(tokenizer, prompt, history=[],max_length=64, do_sample=False):
change model = model.half().to('xpu') to model = model.to('xpu'), in our test, use fp32 model can bring faster generation speed

violet17 commented 1 year ago

@rnwang04 Thank you very much .

rnwang04 commented 1 year ago

@rnwang04 Thank you very much .

You are welcome : )

intel-analytics / ipex-llm

ChatGLM2 transformer int4 stream_chat FP16 on Arc A770 gibberish乱码 #8798