Closed bonre closed 1 month ago
报错内容如下:
Traceback (most recent call last):
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/cli/rlhf.py", line 5, in <module>
rlhf_main()
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/utils/run_utils.py", line 32, in x_main
result = llm_x(args, **kwargs)
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf
return trainer_train(
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/sft.py", line 456, in trainer_train
trainer.train(training_args.resume_from_checkpoint)
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 424, in train
res = super().train(resume_from_checkpoint, *args, **kwargs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 3318, in training_step
loss = self.compute_loss(model, inputs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1513, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1462, in get_batch_loss_metrics
reference_chosen_logps, reference_rejected_logps = self.concatenated_forward(
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 739, in concatenated_forward
return super().concatenated_forward(model, model_kwargs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1391, in concatenated_forward
all_logps, size_completion = self.get_batch_logps(
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 744, in get_batch_logps
return super().get_batch_logps(logits, labels, *args, **kwargs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1343, in get_batch_logps
per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.56 GiB. GPU 0 has a total capacty of 39.38 GiB of which 5.37 GiB is free. Including non-PyTorch memory, this process has 34.01 GiB memory in use. Of the allocated memory 31.68 GiB is allocated by PyTorch, and 1.81 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Train: 23%|█████████████████████████████████████████████████████▉ | 60/266 [12:57<44:28, 12.96s/it]
zero3的报错:
Traceback (most recent call last):
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/cli/rlhf.py", line 5, in <module>
rlhf_main()
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/utils/run_utils.py", line 32, in x_main
result = llm_x(args, **kwargs)
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/rlhf.py", line 25, in llm_rlhf
return trainer_train(
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/sft.py", line 456, in trainer_train
trainer.train(training_args.resume_from_checkpoint)
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 424, in train
res = super().train(resume_from_checkpoint, *args, **kwargs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/transformers/trainer.py", line 3318, in training_step
loss = self.compute_loss(model, inputs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1513, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/trl/trainer/dpo_trainer.py", line 1462, in get_batch_loss_metrics
reference_chosen_logps, reference_rejected_logps = self.concatenated_forward(
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/trainers/mixin.py", line 716, in concatenated_forward
outputs = model(**model_kwargs, use_cache=False)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/peft/peft_model.py", line 1430, in forward
return self.base_model(
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/anaconda3/envs/ixc2/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
return self.model.forward(*args, **kwargs)
File "/home/workspace/multi_modal_model/Model/lnternVL-2.0/InternVL/ms-swift/swift/llm/utils/model.py", line 4587, in _new_func
res = _old_func(submodel, *args, **kwargs)
File "/home/.cache/huggingface/modules/transformers_modules/checkpoint-2880/modeling_internlm2.py", line 1082, in forward
logits = logits.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.42 GiB. GPU 4 has a total capacty of 39.38 GiB of which 4.98 GiB is free. Including non-PyTorch memory, this process has 34.39 GiB memory in use. Of the allocated memory 28.26 GiB is allocated by PyTorch, and 5.47 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
或许可以试试 device_map_config来调
Sorry to bother you. May I ask how it was solved? @bonre
This issue occurred in the previous old version due to incompatibility, which should have been resolved in the latest version. I can now use zero3 for fine-tuning normally. If OOM occurs, you can check whether the input sample is too long or the memory is insufficient to support the model you are using. You can try to set max_length 2048 or lower.
非常感谢您的工作!我在使用DPO训练全量微调后的InternVL2-8B模型遇到了如下问题:
下面是我的微调脚本:
如上是目前还能跑通一部分的脚本,在训练过程中memory会持续上涨直到OOM,无法完成一个完整的epoch训练。 我尝试使用MP+DDP,会报错issue; 我尝试使用deepspeed,但一直到zero3都会直接OOM,一步都无法训练。 我的GPU环境是8*A100 40G。 另外,我在MP时设置device_max_memory,貌似没什么用,还是如图所示不均匀分配: 请问这是BUG还是什么原因导致的呢?非常感谢您能够帮我解答!