dpo InternVL2-8B meets OOM

bonre commented 1 month ago

非常感谢您的工作！我在使用DPO训练全量微调后的InternVL2-8B模型遇到了如下问题：

下面是我的微调脚本：

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift rlhf \
    --rlhf_type dpo \
    --model_type  internvl2-8b \
    --model_id_or_path /data/SWIFT/InternVL/experience_2/Full_0913/checkpoint-2880 \
    --output_dir /data/SWIFT/InternVL/experience_2/dpo_0919 \
    --add_output_dir_suffix False \
    --dtype bf16 \
    --beta 0.1 \
    --rpo_alpha 0.1 \
    --sft_type  lora \
    --dataset /workspace/multi_modal_model/DPOdataset/v2_0920.json \
    --num_train_epochs  2  \
    --lora_target_modules  DEFAULT  \
    --gradient_checkpointing  true  \
    --use_flash_attn true \
    --batch_size  1  \
    --learning_rate  5e-5  \
    --gradient_accumulation_steps  2  \
    --warmup_ratio  0.01  \
    --max_length -1 \
    --save_total_limit  2 \
    --save_strategy 'epoch' \
    --save_steps 6 \
    --device_max_memory 20GB 20GB 20GB 20GB 20GB 20GB 20GB 20GB \
    --logging_steps 2

如上是目前还能跑通一部分的脚本，在训练过程中memory会持续上涨直到OOM，无法完成一个完整的epoch训练。我尝试使用MP+DDP，会报错issue；我尝试使用deepspeed，但一直到zero3都会直接OOM，一步都无法训练。我的GPU环境是8*A100 40G。另外，我在MP时设置device_max_memory，貌似没什么用，还是如图所示不均匀分配： iwEcAqNwbmcDAQTRCc4F0QFYBrBzPISR6BNouwbVKPEo0m0AB9Jw_E-aCAAJomltCgAL0XRa png_720x720q90 请问这是BUG还是什么原因导致的呢？非常感谢您能够帮我解答！

bonre commented 1 month ago

报错内容如下：

Traceback (most recent call last):
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/cli/rlhf.py", line 5, in <module>
    rlhf_main()
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/utils/run_utils.py", line 32, in x_main
    result = llm_x(args, **kwargs)
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf
    return trainer_train(
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/sft.py", line 456, in trainer_train
    trainer.train(training_args.resume_from_checkpoint)
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 424, in train
    res = super().train(resume_from_checkpoint, *args, **kwargs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 3318, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1513, in compute_loss
    loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1462, in get_batch_loss_metrics
    reference_chosen_logps, reference_rejected_logps = self.concatenated_forward(
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 739, in concatenated_forward
    return super().concatenated_forward(model, model_kwargs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1391, in concatenated_forward
    all_logps, size_completion = self.get_batch_logps(
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 744, in get_batch_logps
    return super().get_batch_logps(logits, labels, *args, **kwargs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1343, in get_batch_logps
    per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.56 GiB. GPU 0 has a total capacty of 39.38 GiB of which 5.37 GiB is free. Including non-PyTorch memory, this process has 34.01 GiB memory in use. Of the allocated memory 31.68 GiB is allocated by PyTorch, and 1.81 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Train:  23%|█████████████████████████████████████████████████████▉                                                                                                                                                                                         | 60/266 [12:57<44:28, 12.96s/it]

bonre commented 1 month ago

zero3的报错：

Traceback (most recent call last):
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/cli/rlhf.py", line 5, in <module>
    rlhf_main()
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/utils/run_utils.py", line 32, in x_main
    result = llm_x(args, **kwargs)
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf
    return trainer_train(
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/sft.py", line 456, in trainer_train
    trainer.train(training_args.resume_from_checkpoint)
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 424, in train
    res = super().train(resume_from_checkpoint, *args, **kwargs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 3318, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1513, in compute_loss
    loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1462, in get_batch_loss_metrics
    reference_chosen_logps, reference_rejected_logps = self.concatenated_forward(
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 716, in concatenated_forward
    outputs = model(**model_kwargs, use_cache=False)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/peft/peft_model.py", line 1430, in forward
    return self.base_model(
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
    return self.model.forward(*args, **kwargs)
  File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/utils/model.py", line 4587, in _new_func
    res = _old_func(submodel, *args, **kwargs)
  File "/home/.cache/huggingface/modules/transformers_modules/checkpoint-2880/modeling_internlm2.py", line 1082, in forward
    logits = logits.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.42 GiB. GPU 4 has a total capacty of 39.38 GiB of which 4.98 GiB is free. Including non-PyTorch memory, this process has 34.39 GiB memory in use. Of the allocated memory 28.26 GiB is allocated by PyTorch, and 5.47 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jintao-Huang commented 1 month ago

或许可以试试 device_map_config来调

hill2hill commented 1 week ago

Sorry to bother you. May I ask how it was solved? @bonre

bonre commented 1 week ago

This issue occurred in the previous old version due to incompatibility, which should have been resolved in the latest version. I can now use zero3 for fine-tuning normally. If OOM occurs, you can check whether the input sample is too long or the memory is insufficient to support the model you are using. You can try to set max_length 2048 or lower.

modelscope / ms-swift

dpo InternVL2-8B meets OOM #2082