IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc

Apache License 2.0

6.67k stars 1.26k forks source link

I try to reproduce it and meet the same issue again. And as I found that.

The official vllm does not support the yuan model yet.

Maybe this model's quantized method is not supportted to be load by ipex-llm yet.

# https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/b403a2beb2746c0c923b4eb936fe1e2560c83b19/docs/README_GPTQ_CN.md#3-gptq%E9%87%8F%E5%8C%96%E6%A8%A1%E5%9E%8B%E7%9A%84%E6%8E%A8%E7%90%86
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
# `gptq_model-4bit-128g.safetensors 0-2`
tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)

intel-analytics / ipex-llm

IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

docker exec -ti arc_vllm-new-2 bash

cd /benchmark/all-in-one/

vim config.yaml

run-arc.sh