Closed gesanqiu closed 9 months ago
cc @SunMarc I think that the fix should go on optimum side but I am not sure, wdyt?
Hi @gesanqiu, there is indeed an issue. In the meantime, you can do AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, use_safetensors=True, quantization_config=gptq_config)
. I will fix the issue on optimum @younesbelkada !
@SunMarc Thx. I also set cache_block_outputs=False
in GPTQConfig to avoid OOM when quantizing model.layers blocks.
Yes, this can also help with oom since we don't cache the output !
System Info
transformers
version: 4.36.2Who can help?
@younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I have 4*A40(48G) on my machine, and I tried to quantize a 30B model with
device_map='auto'
, but the gpu memory utilizaiton isn't balanced on all the GPUs during quantizing model.layers blocks and OOM occurred. So I want to quantize the model on CPU runtime, The logs shown as following:I think the issue is because the model is on CPU but the
input_ids
encoded by tokenizer isn't on GPU?Expected behavior
Quantizing the model succeed.