intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.64k stars 1.26k forks source link

TypeError with chatglm2-6b-32k model #9101

Open Kailuo-Lai opened 1 year ago

Kailuo-Lai commented 1 year ago

Code:

from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
model = AutoModel.from_pretrained("./checkpoints/chatglm2-6b-32k/",
                               load_in_low_bit="sym_int4",
                               trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/chatglm2-6b-32k/",
                                          trust_remote_code=True)
prompt = "What is AI?"
CHATGLM2_PROMPT_TEMPLATE = "USER: {prompt}\nASSISTANT:"
model.chat(tokenizer, CHATGLM2_PROMPT_TEMPLATE.format(prompt=prompt), history=[])

Output:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 3
      1 prompt = "What is AI?"
      2 CHATGLM2_PROMPT_TEMPLATE = "USER: {prompt}\nASSISTANT:"
----> 3 model.chat(tokenizer, CHATGLM2_PROMPT_TEMPLATE.format(prompt=prompt), history=[])

File ~/anaconda3/envs/llm-tutorial/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py:1042, in ChatGLMForConditionalGeneration.chat(self, tokenizer, query, history, max_length, num_beams, do_sample, top_p, temperature, logits_processor, **kwargs)
   1039 gen_kwargs = {"max_length": max_length, "num_beams": num_beams, "do_sample": do_sample, "top_p": top_p,
   1040               "temperature": temperature, "logits_processor": logits_processor, **kwargs}
   1041 inputs = self.build_inputs(tokenizer, query, history=history)
-> 1042 outputs = self.generate(**inputs, **gen_kwargs)
   1043 outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):]
   1044 response = tokenizer.decode(outputs)

File ~/anaconda3/envs/llm-tutorial/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
...
--> 655                 presents = torch.cat((presents, kv_cache), dim=0)
    657 if output_hidden_states:
    658     all_hidden_states = all_hidden_states + (hidden_states,)

TypeError: expected Tensor as element 0 in argument 0, but got tuple

Env:

torch 2.0.1 bigdl-llm 2.4.0b20231007 transformers 4.31.0 langchain 0.0.248

jason-dai commented 1 year ago

Can you try:

model = AutoModel.from_pretrained("./checkpoints/chatglm2-6b-32k/",
                               load_in_low_bit="sym_int4",
                               trust_remote_code=True,
                               optimize_model=False)
Kailuo-Lai commented 1 year ago

@jason-dai Thanks, it works. And will this solution affect the efficiency of llm?

jason-dai commented 1 year ago

@jason-dai Thanks, it works. And will this solution affect the efficiency of llm?

Yes - when optimize_model is True, we will apply more aggressive model optimizations, but it is less stable and you can set it to False if running into any issues; we'll take a look at how to enable it for chatglm2-6b-32k.

Kailuo-Lai commented 1 year ago

@jason-dai Thanks, it works. And will this solution affect the efficiency of llm?

Yes - when optimize_model is True, we will apply more aggressive model optimizations, but it is less stable and you can set it to False if running into any issues; we'll take a look at how to enable it for chatglm2-6b-32k.

Ok, I see. Thank you!

plusbang commented 1 year ago

Hi, @Kailuo-Lai We have enabled further model optimizations for chatglm2-6b-32k model now. Please wait 2.4.0b20231016 (which will be released tomorrow) or later version of bigdl-llm to run the following code:

model = AutoModel.from_pretrained("THUDM/chatglm2-6b-32k",
                                  load_in_low_bit="sym_int4",
                                  trust_remote_code=True)
Kailuo-Lai commented 1 year ago

Hi, @Kailuo-Lai We have enabled further model optimizations for chatglm2-6b-32k model now. Please wait 2.4.0b20231016 (which will be released tomorrow) or later version of bigdl-llm to run the following code:

model = AutoModel.from_pretrained("THUDM/chatglm2-6b-32k",
                                  load_in_low_bit="sym_int4",
                                  trust_remote_code=True)

Thank you, I will try in the future.