soulteary / docker-llama2-chat

Play LLaMA2 (official / 中文版 / INT4 / llama2.cpp) Together! ONLY 3 STEPS! ( non GPU / 5GB vRAM / 8~14GB vRAM)
https://www.zhihu.com/people/soulteary/posts
Apache License 2.0
528 stars 82 forks source link

llama2量化后版本加载报错 #15

Open WheatJH opened 11 months ago

WheatJH commented 11 months ago

llama2-7b-chat-hf,按照提供的量化步骤,得到4bit版本的模型并补齐模型文件,通过AutoModelForCausalLM.from_pretrained方式加载时,报NotImplementedError: Cannot copy out of meta tensor; no data! 环境配置: accelerate==0.21.0 bitsandbytes==0.40.2 gradio==3.37.0 protobuf==3.20.3 scipy==1.11.1 sentencepiece==0.1.99 transformers==4.31.0 torch==1.13.0a0+340c412 cuda==11.7

chopin1998 commented 10 months ago

我看了一下, 似乎新版的 transformer 可以直接进行量化后使用, 不需要 额外的量化过程?

model_id = "meta-llama/Llama-2-13b-chat-hf"

nf4_config = BitsAndBytesConfig(load_in_4bit=True,
                                bnb_4bit_quant_type="nf4",
                                bnb_4bit_use_double_quant=True,
                                bnb_4bit_compute_dtype=torch.bfloat16)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, 
                                                 quantization_config=nf4_config)