StackLlaMa 2 dpo train with deepspeed oom

System Info

transformers             4.37.2
accelerate               0.26.1
peft                     0.8.2
bitsandbytes             0.43.0.dev0 # latest built from source
trl                      0.7.11.dev0 # latest built from source
torch                    2.2.0
python 3.9.18

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

git clone https://github.com/huggingface/trl.git

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml examples/research_projects/stack_llama_2/scripts/dpo_llama2.py \
    --model_name_or_path="sft/final_checkpoint" \
    --output_dir="dpo" \
    --report_to="tensorboard"

Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
    dpo_trainer = DPOTrainer(
  File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
    model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 469.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Process 70018 has 4.62 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
    dpo_trainer = DPOTrainer(
  File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
    model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 477.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Process 70020 has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):

I have 8 a100 40GB gpus. I think for llama2-7b, it's enough. And I checked the yaml config, it has "zero3_init_flag: true". So I think it will not load the whole model to a single gpu/device but load each shard its own parameters.

But in peft/utils/other.py

    if not is_gptq_quantized:
        # cast all non INT8 parameters to fp32
        for param in model.parameters():
            if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
                param.data = param.data.to(torch.float32)

it seems peft need cast bf16 parameter to fp32. When it ran, I saw there are 8 processes ran in gpu0 and the gpu0 memory is used up and it failed. So I guess the peft don't support shard parameters to 8 gpus but load all to a single gpu.

Expected behavior

no exception

huggingface / accelerate