Can't quantize gptq model on CPU runtime?

gesanqiu commented 9 months ago

System Info

transformers version: 4.36.2
Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
Python version: 3.10.0
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@younesbelkada

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import torch

model_path = r'/data1/ls/hf_models/multi_lan-mango-dev/'
save_path = r'/data1/ls/hf_models/multi_lan-mango-dev-gptq'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
gptq_config = GPTQConfig(bits=4, dataset="wikitext2", tokenizer=tokenizer, group_size=32, use_exllama=False)
quantized_model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map='cpu', use_safetensors=True, quantization_config=gptq_config)

# quantized_model.to("cpu")
quantized_model.save_pretrained(save_path)

I have 4*A40(48G) on my machine, and I tried to quantize a 30B model with device_map='auto', but the gpu memory utilizaiton isn't balanced on all the GPUs during quantizing model.layers blocks and OOM occurred. So I want to quantize the model on CPU runtime, The logs shown as following:

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:07<00:00,  2.10it/s]
Traceback (most recent call last):
  File "/home/dell/workSpace/test/gptq_hf.py", line 9, in <module>
    quantized_model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map='cpu', use_safetensors=True, quantization_config=gptq_config)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3780, in from_pretrained
    quantizer.quantize_model(model, quantization_config.tokenizer)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 431, in quantize_model
    model(**data)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward
    outputs = self.model(
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1025, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/dell/anaconda3/envs/vllm-kv_quant/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

I think the issue is because the model is on CPU but the input_ids encoded by tokenizer isn't on GPU?

Expected behavior

Quantizing the model succeed.

younesbelkada commented 9 months ago

cc @SunMarc I think that the fix should go on optimum side but I am not sure, wdyt?

SunMarc commented 9 months ago

Hi @gesanqiu, there is indeed an issue. In the meantime, you can do AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, use_safetensors=True, quantization_config=gptq_config). I will fix the issue on optimum @younesbelkada !

gesanqiu commented 9 months ago

@SunMarc Thx. I also set cache_block_outputs=False in GPTQConfig to avoid OOM when quantizing model.layers blocks.

SunMarc commented 9 months ago

Yes, this can also help with oom since we don't cache the output !

huggingface / transformers