intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.73k stars 1.27k forks source link

assert error use ipex pytorch #12385

Closed piDack closed 1 week ago

piDack commented 1 week ago

llama model

convert to low bit model

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, trust_remote_code=True,optimize_model=True,low_cpu_mem_usage=True, use_cache=True).eval()
model_path = "C:/Users/Public/yhl/llama-low"
low_bit = "4bit"
model.save_low_bit(model_path+'-'+low_bit)
tokenizer.save_pretrained(model_path+'-'+low_bit)

load & run

# load
model = AutoModelForCausalLM.load_low_bit("C:/Users/Public/yhl/llama-low/").eval()
# run
generation_config = GenerationConfig(
                max_new_tokens=128, 
                do_sample=True, 
                temperature=0.6, 
                top_p = 0.9,eos_token_id=[59246,59253,59255])   
with torch.inference_mode():
   output = model.generate(inputs,
                           do_sample=False,
                        #    max_new_token=128,
                           generation_config=generation_config) # warm-up

Error

Assertion failed: inv_freq.scalar_type() == query.scalar_type() && inv_freq.scalar_type() == key.scalar_type(), file rope.cpp, line 100

I try to use

model = AutoModelForCausalLM.load_low_bit("C:/Users/Public/yhl/llama-low/").eval()
model = model.half()
model = model.to(device)

It work great! why ?

leonardozcm commented 1 week ago

Sorry that I can not reproduce this issue on ARC

-------------------- Prompt --------------------
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

-------------------- Output (skip_special_tokens=False) --------------------
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

A question that gets to the heart of the 21st century!

Artificial Intelligence (AI) refers to the development of computer systems that can perform tasks that would typically require human intelligence, such as learning, problem-solving, decision-making, and perception. AI systems can analyze data, recognize patterns, and make decisions without being explicitly programmed to do so.

There are many types of AI, including:

1. **Narrow or Weak AI**: Designed to perform a specific task, such as:
        * Virtual assistants (e.g., Siri, Alexa)
        * Image recognition systems (e.g., facial recognition)
        * Natural Language Processing

would you mind provide more infomation about your env like the version of ipex-llm, llama version and your device name?

piDack commented 1 week ago

Hi there, I apologize for the confusion caused by the incorrect code snippet in my previous message. The correct code to reproduce the issue should be:

model = AutoModelForCausalLM.load_low_bit(model_path, optimize_model=True, trust_remote_code=True, use_cache=True, torch_dtype=torch.float16).eval()

my ipex-llm version is:

2.2.0b20241110

model is llama-base model train by myself arch like:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048, padding_idx=128001)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
RMS(fp32)
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)    4.1ms
          (k_proj): Linear(in_features=2048, out_features=512, bias=False).      1.1ms
          (v_proj): Linear(in_features=2048, out_features=512, bias=False).      1.1ms
//attention  ?
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False) 4.1ms
          (rotary_emb): LlamaRotaryEmbedding()
        )
RMS(fp32)
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False) 16.1893ms
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False) 16.1893ms
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)  70.5455ms
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
leonardozcm commented 1 week ago

hi, this is because when you pass torch_dtype=torch.float16 this doesn't actually change inv_freq dtype as a buffer(hardcode by transformers llama architecture sorry), and when you run model.half(), this will scan all float point tensor and buffer and convert them into fp16, and this is working and equal to passing torch_dtype=torch.float16. If there is no special need, we recommend using the default configuration without specified torch_dtype=torch.float16, with default dtype we are running in a mixed precision mode and also works well. try this:

model = AutoModelForCausalLM.load_low_bit(model_path, optimize_model=True, trust_remote_code=True, use_cache=True).eval()
# or
model = AutoModelForCausalLM.load_low_bit(model_path, optimize_model=True, trust_remote_code=True, use_cache=True).eval()
model = model.half().to('xpu')
piDack commented 1 week ago

Ok,thx