CUDA OOM while loading Llama3.1 405B

System Info

- `Accelerate` version: 0.33.0
- Platform: Linux-5.15.133+-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.25.2
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1842.60 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: NO
    - mixed_precision: no
    - use_cpu: True
    - debug: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: False
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: False
    - tpu_use_cluster: False
    - tpu_use_sudo: False

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

This line model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-405B-Instruct", ...) throws CUDA OOM. Any help would be appreciated!!

pretrain.py

...
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    # bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-405B-Instruct",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
...

zero3_config.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

command accelerate launch --config_file zero3_config.yaml pretrain.py --num_processes=8 --multi_gpu To be precise, I'm running this command with kubeflow, so

@dsl.container_component
def pretrain():
    return dsl.ContainerSpec(
        image=IMAGE_PATH,
        command=['accelerate', 'launch', '--config_file', 'zero3_config.yaml', 'pretrain.py', '--num_processes=8', '--multi_gpu'])

@dsl.pipeline(name=PIPELINE_NAME,
              description="pretrain",
              pipeline_root=PIPELINE_ROOT,
              )
def pipeline_func(
):
    train_task = pretrain()
    train_task.set_accelerator_type("nvidia.com/gpu")
    train_task.set_accelerator_limit(8)

versions

# bitsandbytes>=0.43.0, accelerate>=0.28.0, transformers>4.38.2, trl>0.7.11 and peft>0.9.0
import bitsandbytes
import accelerate
import transformers
import trl
import peft
print(f'bitsandbytes=={bitsandbytes.__version__}') # 0.43.2
print(f'accelerate=={accelerate.__version__}') # 0.33.0
print(f'transformers=={transformers.__version__}') # 4.43.2
print(f'trl=={trl.__version__}') # 0.9.6
print(f'peft=={peft.__version__}') # 0.11.1

error log

Traceback (most recent call last):
  File "//pretrain.py", line 147, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3916, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4390, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 938, in _load_state_dict_into_meta_model
    hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
  File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 217, in create_quantized_param
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device)
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 327, in to
    return self._quantize(device)
  File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 291, in _quantize
    w = self.data.contiguous().to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.62 GiB. GPU 0 has a total capacity of 79.11 GiB of which 1.29 GiB is free. Process 194 has 0 bytes memory in use. Including non-PyTorch memory, this process has 0 bytes memory in use. Process 201 has 0 bytes memory in use. Process 195 has 0 bytes memory in use. Process 199 has 0 bytes memory in use. Process 198 has 0 bytes memory in use. Process 200 has 0 bytes memory in use. Process 196 has 0 bytes memory in use. Of the allocated memory 7.30 GiB is allocated by PyTorch, and 8.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (
https://pytorch.org/docs/stable/notes/cuda.html#environment-variables
)

Expected behavior

Load and train Llama3.1 405B without CUDA OOM.

huggingface / accelerate

CUDA OOM while loading Llama3.1 405B #2978

System Info

Information

Tasks

Reproduction

Expected behavior