Open tisgotos opened 1 month ago
在对pretrain_gpt代码进行修改之后(应用patch.py:from fmoe.megatron.patch import patch_loss_func_v2_5, patch_forward_step),用单gpu训练时显示cuda out of memory,请问这种情况有对应的解决办法吗
[before the start of training step] datetime: 2024-09-08 19:33:33
Traceback (most recent call last):
File "pretrain_gpt.py", line 128, in
oom 说明模型或者中间结果太大了. 建议换个小点的模型.
在应用完补丁执行pretrain_gpt.py遇到的问题 Traceback (most recent call last): File "pretrain_gpt.py", line 126, in
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/workspace/Megatron-LM/megatron/training.py", line 157, in pretrain
iteration = train(forward_step_func,
File "/workspace/Megatron-LM/megatron/training.py", line 630, in train
train_step(forward_step_func,
File "/workspace/Megatron-LM/megatron/training.py", line 377, in train_step
losses_reduced = forward_backward_func(
File "/workspace/Megatron-LM/megatron/schedules.py", line 132, in forward_backward_no_pipelining
output_tensor, bal_loss = forward_step(forward_step_func, data_iterator, model,
File "/workspace/Megatron-LM/megatron/schedules.py", line 61, in forward_step
output_tensor, loss_func, bal_loss = forward_step_func(data_iterator, model)
ValueError: not enough values to unpack (expected 3, got 2)
pretrain_gpt源码:
def forward_step(data_iterator, model): """Forward step.""" args = get_args() timers = get_timers()