Closed piDack closed 1 week ago
Sorry that I can not reproduce this issue on ARC
-------------------- Prompt --------------------
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
What is AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
-------------------- Output (skip_special_tokens=False) --------------------
<|begin_of_text|><|begin_of_text|><|start_header_id|>user<|end_header_id|>
What is AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
A question that gets to the heart of the 21st century!
Artificial Intelligence (AI) refers to the development of computer systems that can perform tasks that would typically require human intelligence, such as learning, problem-solving, decision-making, and perception. AI systems can analyze data, recognize patterns, and make decisions without being explicitly programmed to do so.
There are many types of AI, including:
1. **Narrow or Weak AI**: Designed to perform a specific task, such as:
* Virtual assistants (e.g., Siri, Alexa)
* Image recognition systems (e.g., facial recognition)
* Natural Language Processing
would you mind provide more infomation about your env like the version of ipex-llm, llama version and your device name?
Hi there, I apologize for the confusion caused by the incorrect code snippet in my previous message. The correct code to reproduce the issue should be:
model = AutoModelForCausalLM.load_low_bit(model_path, optimize_model=True, trust_remote_code=True, use_cache=True, torch_dtype=torch.float16).eval()
my ipex-llm version is:
2.2.0b20241110
model is llama-base model train by myself arch like:
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048, padding_idx=128001)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
RMS(fp32)
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False) 4.1ms
(k_proj): Linear(in_features=2048, out_features=512, bias=False). 1.1ms
(v_proj): Linear(in_features=2048, out_features=512, bias=False). 1.1ms
//attention ?
(o_proj): Linear(in_features=2048, out_features=2048, bias=False) 4.1ms
(rotary_emb): LlamaRotaryEmbedding()
)
RMS(fp32)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False) 16.1893ms
(up_proj): Linear(in_features=2048, out_features=8192, bias=False) 16.1893ms
(down_proj): Linear(in_features=8192, out_features=2048, bias=False) 70.5455ms
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
hi, this is because when you pass torch_dtype=torch.float16
this doesn't actually change inv_freq
dtype as a buffer(hardcode by transformers llama architecture sorry), and when you run model.half(), this will scan all float point tensor and buffer and convert them into fp16, and this is working and equal to passing torch_dtype=torch.float16
.
If there is no special need, we recommend using the default configuration without specified torch_dtype=torch.float16
, with default dtype we are running in a mixed precision mode and also works well.
try this:
model = AutoModelForCausalLM.load_low_bit(model_path, optimize_model=True, trust_remote_code=True, use_cache=True).eval()
# or
model = AutoModelForCausalLM.load_low_bit(model_path, optimize_model=True, trust_remote_code=True, use_cache=True).eval()
model = model.half().to('xpu')
Ok,thx
llama model
convert to low bit model
load & run
Error
I try to use
It work great! why ?