intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.67k stars 1.26k forks source link

IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

Open jianweimama opened 1 month ago

jianweimama commented 1 month ago

源2.0-M32大模型研发团队深入分析当前主流的量化方案,综合评估模型压缩效果和精度损失表现,最终采用了GPTQ量化方法,并采用AutoGPTQ作为量化框架。


Model: Yuan2-M32-HF-INT4 https://blog.csdn.net/2401_82700030/article/details/141469514 容器: intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1

Test Step: Log into container:

docker exec -ti arc_vllm-new-2 bash

cd /benchmark/all-in-one/

vim config.yaml

Config.yaml 配置: image

run-arc.sh

运行报错 , 结果如下log. Results Log: image

image

hzjane commented 1 month ago

I try to reproduce it and meet the same issue again. And as I found that.

  1. The official vllm does not support the yuan model yet.
  2. Maybe this model's quantized method is not supportted to be load by ipex-llm yet.
    # https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/b403a2beb2746c0c923b4eb936fe1e2560c83b19/docs/README_GPTQ_CN.md#3-gptq%E9%87%8F%E5%8C%96%E6%A8%A1%E5%9E%8B%E7%9A%84%E6%8E%A8%E7%90%86
    quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
    # `gptq_model-4bit-128g.safetensors 0-2`
    tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)