huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.69k stars 934 forks source link

CUDA OOM when preparing two models #3004

Open ojh31 opened 1 month ago

ojh31 commented 1 month ago

System Info

- `Accelerate` version: 0.29.2
- Platform: Linux-5.15.0-106-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /usr/local/venv/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.38 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        Not found

Information

Tasks

Reproduction

Run the following on a 2*H100 node:

accelerate launch --config_file=accelerate_config.yaml --num-processes=2 foo.py

accelerate_config.yaml:

# Generated with `accelerate config` and mostly stayed with default values.
compute_environment: LOCAL_MACHINE
debug: false
# We want FSDP to shard model parameters between devices.
distributed_type: FSDP
downcast_bf16: "no"
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: "no"
num_machines: 1
# We overwrite this with a CLI argument
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

foo.py:

from accelerate import Accelerator
import torch
from transformers import AutoModelForCausalLM

accelerator = Accelerator(cpu=False)

model1 = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-14B-Chat")
model2 = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-7b-v1.5")
model1, model2 = accelerator.prepare(model1, model2)

Expected behavior

The machine should not throw a CUDA OOM error. The models take up ~4*(14+7)=84GB which should comfortably fit on a 2x80gb machine. I can load the 7b model after setting torch.cuda.set_per_process_memory_fraction(0.25) for example. Somehow trying to load both is causing a massive memory spike.

github-actions[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.