huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.7k stars 938 forks source link

StackLlaMa 2 dpo train with deepspeed oom #2484

Closed fancyerii closed 6 months ago

fancyerii commented 6 months ago

System Info

transformers             4.37.2
accelerate               0.26.1
peft                     0.8.2
bitsandbytes             0.43.0.dev0 # latest built from source
trl                      0.7.11.dev0 # latest built from source
torch                    2.2.0
python 3.9.18

Information

Tasks

Reproduction

git clone https://github.com/huggingface/trl.git

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml examples/research_projects/stack_llama_2/scripts/dpo_llama2.py \
    --model_name_or_path="sft/final_checkpoint" \
    --output_dir="dpo" \
    --report_to="tensorboard"
Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
    dpo_trainer = DPOTrainer(
  File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
    model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 469.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Process 70018 has 4.62 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
    dpo_trainer = DPOTrainer(
  File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
    model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 477.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Process 70020 has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):

I have 8 a100 40GB gpus. I think for llama2-7b, it's enough. And I checked the yaml config, it has "zero3_init_flag: true". So I think it will not load the whole model to a single gpu/device but load each shard its own parameters.

But in peft/utils/other.py

    if not is_gptq_quantized:
        # cast all non INT8 parameters to fp32
        for param in model.parameters():
            if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
                param.data = param.data.to(torch.float32)

it seems peft need cast bf16 parameter to fp32. When it ran, I saw there are 8 processes ran in gpu0 and the gpu0 memory is used up and it failed. So I guess the peft don't support shard parameters to 8 gpus but load all to a single gpu.

Expected behavior

no exception

BenjaminBossan commented 6 months ago

Duplicate of https://github.com/huggingface/peft/issues/1507. Please only open one issue in the future :)