StackLlaMa 2 dpo train with deepspeed oom

I can train stack llama2 with 8gpus with ddp. Which I have to use {"device_map": {"": Accelerator().local_process_index}} , detail info can be found here. I want to use deepspeed stage 3 to train it because I will train 70b later. For large model, I can't train with ddp.

So I ran it with the deepspeed_zero3.yaml:

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml examples/research_projects/stack_llama_2/scripts/dpo_llama2.py \
    --model_name_or_path="sft/final_checkpoint" \
    --output_dir="dpo" \
    --report_to="tensorboard"

it failed with:


Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 140, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2992, in from_pretrained
    raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.

As in [this issue]((https://github.com/huggingface/trl/issues/1348), I passed device_map to AutoModelForCausalLM.from_pretrained. But it seems DeepSpeed Zero-3 is not compatible with passing device_map.

So I remove this parameter. It ran oom this time:

Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
    dpo_trainer = DPOTrainer(
  File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
    model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 469.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Process 70018 has 4.62 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
    dpo_trainer = DPOTrainer(
  File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
    model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 477.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Process 70020 has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):

I have 8 a100 40GB gpus. I think for llama2-7b, it's enough. And I checked the yaml config, it has "zero3_init_flag: true". So I think it will not load the whole model to a single gpu/device but load each shard its own parameters.

But in peft/utils/other.py

    if not is_gptq_quantized:
        # cast all non INT8 parameters to fp32
        for param in model.parameters():
            if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
                param.data = param.data.to(torch.float32)

it seems peft need cast bf16 parameter to fp32. When it ran, I saw there are 8 processes ran in gpu0 and the gpu0 memory is used up and it failed. So I guess the peft don't support shard parameters to 8 gpus but load all to a single gpu.

huggingface / trl

StackLlaMa 2 dpo train with deepspeed oom #1358