huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.56k stars 26.91k forks source link

Can't quantize gptq model on CPU runtime? #28632

Closed gesanqiu closed 9 months ago

gesanqiu commented 9 months ago

System Info

Who can help?

@younesbelkada

Information

Tasks

Reproduction

from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import torch

model_path = r'/data1/ls/hf_models/multi_lan-mango-dev/'
save_path = r'/data1/ls/hf_models/multi_lan-mango-dev-gptq'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
gptq_config = GPTQConfig(bits=4, dataset="wikitext2", tokenizer=tokenizer, group_size=32, use_exllama=False)
quantized_model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map='cpu', use_safetensors=True, quantization_config=gptq_config)

# quantized_model.to("cpu")
quantized_model.save_pretrained(save_path)

I have 4*A40(48G) on my machine, and I tried to quantize a 30B model with device_map='auto', but the gpu memory utilizaiton isn't balanced on all the GPUs during quantizing model.layers blocks and OOM occurred. So I want to quantize the model on CPU runtime, The logs shown as following:

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:07<00:00,  2.10it/s]
Traceback (most recent call last):
  File "/home/dell/workSpace/test/gptq_hf.py", line 9, in <module>
    quantized_model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map='cpu', use_safetensors=True, quantization_config=gptq_config)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3780, in from_pretrained
    quantizer.quantize_model(model, quantization_config.tokenizer)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 431, in quantize_model
    model(**data)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
    outputs = self.model(
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1025, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

I think the issue is because the model is on CPU but the input_ids encoded by tokenizer isn't on GPU?

Expected behavior

Quantizing the model succeed.

younesbelkada commented 9 months ago

cc @SunMarc I think that the fix should go on optimum side but I am not sure, wdyt?

SunMarc commented 9 months ago

Hi @gesanqiu, there is indeed an issue. In the meantime, you can do AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, use_safetensors=True, quantization_config=gptq_config). I will fix the issue on optimum @younesbelkada !

gesanqiu commented 9 months ago

@SunMarc Thx. I also set cache_block_outputs=False in GPTQConfig to avoid OOM when quantizing model.layers blocks.

SunMarc commented 9 months ago

Yes, this can also help with oom since we don't cache the output !