Open jianweimama opened 1 month ago
I try to reproduce it and meet the same issue again. And as I found that.
yuan
model yet.# https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/b403a2beb2746c0c923b4eb936fe1e2560c83b19/docs/README_GPTQ_CN.md#3-gptq%E9%87%8F%E5%8C%96%E6%A8%A1%E5%9E%8B%E7%9A%84%E6%8E%A8%E7%90%86
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
# `gptq_model-4bit-128g.safetensors 0-2`
tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
源2.0-M32大模型研发团队深入分析当前主流的量化方案,综合评估模型压缩效果和精度损失表现,最终采用了GPTQ量化方法,并采用AutoGPTQ作为量化框架。
Model: Yuan2-M32-HF-INT4 https://blog.csdn.net/2401_82700030/article/details/141469514 容器: intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1
Test Step: Log into container:
docker exec -ti arc_vllm-new-2 bash
cd /benchmark/all-in-one/
vim config.yaml
Config.yaml 配置:
run-arc.sh
运行报错 , 结果如下log. Results Log: