CUDA OOM when preparing two models

System Info

- `Accelerate` version: 0.29.2
- Platform: Linux-5.15.0-106-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /usr/local/venv/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.38 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

Run the following on a 2*H100 node:

accelerate launch --config_file=accelerate_config.yaml --num-processes=2 foo.py

accelerate_config.yaml:

# Generated with `accelerate config` and mostly stayed with default values.
compute_environment: LOCAL_MACHINE
debug: false
# We want FSDP to shard model parameters between devices.
distributed_type: FSDP
downcast_bf16: "no"
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: "no"
num_machines: 1
# We overwrite this with a CLI argument
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

foo.py:

from accelerate import Accelerator
import torch
from transformers import AutoModelForCausalLM

accelerator = Accelerator(cpu=False)

model1 = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-14B-Chat")
model2 = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-7b-v1.5")
model1, model2 = accelerator.prepare(model1, model2)

Expected behavior

The machine should not throw a CUDA OOM error. The models take up ~4*(14+7)=84GB which should comfortably fit on a 2x80gb machine. I can load the 7b model after setting torch.cuda.set_per_process_memory_fraction(0.25) for example. Somehow trying to load both is causing a massive memory spike.

huggingface / accelerate

CUDA OOM when preparing two models #3004

System Info

Information

Tasks

Reproduction

Expected behavior