intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.51k stars 1.24k forks source link

ChatGLM2 transformer int4 stream_chat FP16 on Arc A770 gibberish乱码 #8798

Closed violet17 closed 1 year ago

violet17 commented 1 year ago

Hi, I test chatGLM2 with transformer int4 weights using model.stream_chat on A770 and get gibberish output.

test code:

import os
import torch
import time
import argparse
import numpy as np

import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer

CHATGLM_V2_PROMPT_FORMAT = "问:{prompt}\n\n答:"

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Llama2 model')
    parser.add_argument('--chatglm2-repo-id-or-model-path', type=str, default="./chatglm2-6b-int4",
                        help=' to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--n-predict', type=int, default=128,
                        help='Max tokens to predict')

    args = parser.parse_args()

    model_path = args.chatglm2_repo_id_or_model_path

    print("loading chatglm2---------")
    model =  AutoModel.load_low_bit(model_path, trust_remote_code=True, optimize_model=False)
    model = model.half().to('xpu')
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    print("loading chatglm2---------Done")

    prompt="你好"
    print("prompt: ", prompt)
    with torch.inference_mode():
        prompt= CHATGLM_V2_PROMPT_FORMAT.format(prompt=prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, do_sample=False, max_new_tokens=32)
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print("output: ", output_str)

    t0 = time.time()
    with torch.inference_mode():
        # Stream chat
        response_ = ""
        print('-'*20, 'Stream Chat Output', '-'*20)
        for response, history in model.stream_chat(tokenizer, prompt, history=[],max_length=64):
            print(response.replace(response_, ""), end="")
            response_ = response
    print("cost time: ", time.time()-t0)

output:

$ pip list | grep bigdl
bigdl-core-xe               2.4.0b20230810
bigdl-llm                   2.4.0b20230813
$ python test_chatglm2.py  
/home/adc2/miniconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
loading chatglm2---------
loading chatglm2---------Done
prompt:  你好
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/adc2/.cache/huggingface/modules/transformers_modules/chatglm2-6b-int4/modeling_chatglm.py:374: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard. 
 (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:826.)
  query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
output:  问:你好

答: 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
-------------------- Stream Chat Output --------------------
你戏曲上车吗></br> primal world的使用袪除吗队伍凡事污染全队的 land poem called基础设施建设Care 卡 Warner Costa Rica敢怒DROPqua缭oxy质疑 behold thecause如果Enumcost time:  2.587573289871216

newer version output:

$ pip list | grep bigdl
bigdl-core-xe               2.4.0b20230823
bigdl-llm                   2.4.0b20230823
$ python test_chatglm2.py  
/home/adc2/miniconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
loading chatglm2---------
loading chatglm2---------Done
prompt:  你好
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/adc2/.cache/huggingface/modules/transformers_modules/chatglm2-6b-int4/modeling_chatglm.py:374: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard. 
 (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:826.)
  query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
output:  问:你好

答: 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
-------------------- Stream Chat Output --------------------
剧中人,蝶衣irtcost time:  0.6197757720947266
rnwang04 commented 1 year ago

Hi, I have reproduced this issue.

Actually this is caused by do_sample default to True in model.stream_chat, and when model is float16, do_sample=True will cause gibberish output which is a known issue.

There are two ways to fix it:

violet17 commented 1 year ago

@rnwang04 Thank you very much .

rnwang04 commented 1 year ago

@rnwang04 Thank you very much .

You are welcome : )