StackLlaMa 2 dpo train with deepspeed oom

fancyerii commented 6 months ago

System Info

transformers 4.37.2 accelerate 0.26.1 peft 0.8.2 bitsandbytes 0.43.0.dev0 # latest built from source trl 0.7.11.dev0 # latest built from source torch 2.2.0 python 3.9.18

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

git clone https://github.com/huggingface/trl.git

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml examples/research_projects/stack_llama_2/scripts/dpo_llama2.py \
    --model_name_or_path="sft/final_checkpoint" \
    --output_dir="dpo" \
    --report_to="tensorboard"

error message:

Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
    dpo_trainer = DPOTrainer(
  File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
    model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 469.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Process 70018 has 4.62 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
    dpo_trainer = DPOTrainer(
  File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
    model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 477.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Process 70020 has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):

I have 8 a100 40GB gpus. I think for llama2-7b, it's enough. And I checked the yaml config, it has "zero3_init_flag: true". So I think it will not load the whole model to a single gpu/device but load each shard its own parameters.

But in peft/utils/other.py

    if not is_gptq_quantized:
        # cast all non INT8 parameters to fp32
        for param in model.parameters():
            if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
                param.data = param.data.to(torch.float32)

it seems peft need cast bf16 parameter to fp32. When it ran, I saw there are 8 processes ran in gpu0 and the gpu0 memory is used up and it failed. So I guess the peft don't support shard parameters to 8 gpus but load all to a single gpu.

Expected behavior

run correctly.

BenjaminBossan commented 6 months ago

I'm not an expert on DeepSpeed, so I'm not sure why this is happening.

@pacman100 did however recently add a comprehensive guide to our docs. Maybe you can find something there that could help you?

pacman100 commented 6 months ago

You cannot use DeepSpeed with bitsandbytes Quantization as they both aren't compatible.

pacman100 commented 6 months ago

You should either use LoRA+DeepSpeed or only QLoRA.

huggingface / peft