huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.34k stars 26.12k forks source link

Can the BNB quantization process be on GPU? #30770

Closed mxjmtxrm closed 3 weeks ago

mxjmtxrm commented 3 months ago

System Info

Who can help?

@SunMarc and @younesbelkada

Information

Tasks

Reproduction

I noticed that when quantization config is not None and is_deepspeed_zero3_enabled() is True, the device map is 'cpu'. Thus the quantization process is on CPU. Why this? If the quantization can be run on the GPUs?

Expected behavior

--

younesbelkada commented 3 months ago

Hi @mxjmtxrm Thanks for the issue ! Do you have a small reproducer of the issue for us to better picture what is going on?

mxjmtxrm commented 3 months ago
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=False,
            bnb_4bit_quant_storage=torch.float16,
        )
model = AutoModelForCausalLM.from_pretrained(
            ‘meta-llama/Llama-2-7b-chat-hf’,
            torch_dtype=orch.float16,
            trust_remote_code=True,
            quantization_config=bnb_config,
            attn_implementation="flash_attention_2",
        )

The command is:

accelerate launch --config_file "configs/deepspeed_config_z3.yaml" test.py

And the deepspeed_config_z3.yaml is

compute_environment: LOCAL_MACHINE                                                                                                                                           
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The GPU memory usage during from_pretrained is very slow, as the quantization process is going on CPU. Same with other quantization method, like EETQ and AWQ.

amyeroberts commented 2 months ago

cc @younesbelkada @SunMarc

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 3 days ago

Hey! Pretty sure the quantization cannot happen on CPU (yet) and is just a bit slow on GPU as well