RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

fhcbvdjknl commented 1 month ago

I meet this issue when training the Multi3DRefer within 2-3 epochs,how can I resolve this issue?

KESHEN-ZHOU commented 1 month ago

Sometimes, resuming the last checkpoint could help. But for me, resuming and retraining only works for 1-2 epochs and eventually gets this DivBackward0 error again or Segmentation Fault error.

fhcbvdjknl commented 1 month ago

Thanks,but it doesn't work and still gets this issue.

sosppxo commented 4 weeks ago

Thank you for your feedback. We have also observed similar issues; however, in our experiments, this phenomenon mainly occurs in the later stages of training, once the model has converged, and therefore does not impact final performance. Based on our experience, this issue is less likely to arise in the earlier stages of training. We would appreciate it if you could provide more detailed information (including hardware model, environment configuration, training logs, etc.), so we can conduct a comprehensive investigation into the root cause and work towards a resolution.

fhcbvdjknl commented 4 weeks ago

4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ，here is the training log. 20241101_114911.log

fhcbvdjknl commented 4 weeks ago

/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py:200: UserWarning: Error detected in DivBackward0. Traceback of forward call that caused the error: File "tools/train_3dgres.py", line 323, in gorilla.launch( File "/root/miniconda3/lib/python3.8/site-packages/gorilla/core/launch.py", line 68, in launch main_func(args) File "tools/train_3dgres.py", line 309, in main train(epoch, model, train_loader, optimizer, lr_scheduler, cfg, logger, writer) File "tools/train_3dgres.py", line 72, in train loss, log_vars = model(batch, mode='loss') File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/autodl-tmp/MDIN/gres_model/model/model.py", line 111, in forward return self.loss(batch) File "/root/autodl-tmp/MDIN/gres_model/utils/utils.py", line 20, in wrapper return func(*new_args, *new_kwargs) File "/root/autodl-tmp/MDIN/gres_model/model/model.py", line 132, in loss loss, loss_dict = self.criterion(out, gt_pmasks, gt_spmasks, sp_ref_masks, object_idss, sp_ins_labels, dense_maps, lang_masks, fps_seed_sp, sp_coords_float, batch_offsets) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, *kwargs) File "/root/autodl-tmp/MDIN/gres_model/model/loss.py", line 950, in forward loss_i, loss_out_i = self.get_layer_loss(i, aux_outputs, pad_masks, target, indices, lang_masks, proj_tokens) File "/root/autodl-tmp/MDIN/gres_model/model/loss.py", line 721, in get_layer_loss sem_loss = self.loss_sem_align(proj_tokens, proj_queries, lang_masks, target, indices, num_insts) File "/root/autodl-tmp/MDIN/gres_model/model/loss.py", line 510, in loss_sem_align torch.matmul(norm_img_emb, norm_text_emb.transpose(-1, -2)) (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "tools/train_3dgres.py", line 323, in gorilla.launch( File "/root/miniconda3/lib/python3.8/site-packages/gorilla/core/launch.py", line 68, in launch main_func(args) File "tools/train_3dgres.py", line 309, in main train(epoch, model, train_loader, optimizer, lr_scheduler, cfg, logger, writer) File "tools/train_3dgres.py", line 86, in train loss.backward() File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

KESHEN-ZHOU commented 4 weeks ago

4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ，here is the training log. 20241101_114911.log

Similar environment 4090 with pytorch 1.12.1+cu113, also have tried another version pytorch 1.13.1+cu116 with spconv-cu116 for fixing a warning but not help eventually.

alala521 commented 4 weeks ago

4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ，here is the training log.4090 和 PyTorch 2.0.0+cuda11.8。使用 3DRes 进行训练是稳定的，并且可以获得类似的结果，但是使用 3DRes 进行训练会遇到这个问题，这是训练日志。 20241101_114911.log

4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ，here is the training log. 20241101_114911.log4090 和 PyTorch 2.0.0+cuda11.8。使用 3DRes 进行训练是稳定的，并且可以获得类似的结果，但是使用 3DRes 进行训练会遇到这个问题，这是训练日志。20241101_114911.log

Similar environment 4090 with pytorch 1.12.1+cu113, also have tried another version pytorch 1.13.1+cu116 with spconv-cu116 for fixing a warning but not help eventually.类似的环境 4090 与 pytorch 1.12.1+cu113 相同，也尝试了另一个版本 pytorch 1.13.1+cu116 和 spconv-cu116 来修复警告，但最终没有帮助。

I'm encountering the same issue: the training log is identical, and an issue occurs at the 820th batch in the 3rd epoch, causing a crash. When I resume training, the same issue reappears at the 820th batch in the 5th epoch, leading to another crash, and it crashes again at the 820th batch in the 7th epoch. It’s clear that using the same seed for resuming is causing this issue, as the code fails every time it processes the 820th batch in these specific epochs. However, I haven't yet resolved this issue.

alala521 commented 4 weeks ago

4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ，here is the training log.4090 和 PyTorch 2.0.0+cuda11.8。使用 3DRes 进行训练是稳定的，并且可以获得类似的结果，但是使用 3DRes 进行训练会遇到这个问题，这是训练日志。 20241101_114911.log

I'm encountering the same issue: the training log is identical, and an issue occurs at the 820th batch in the 3rd epoch, causing a crash. When I resume training, the same issue reappears at the 820th batch in the 5th epoch, leading to another crash, and it crashes again at the 820th batch in the 7th epoch. It’s clear that using the same seed for resuming is causing this issue, as the code fails every time it processes the 820th batch in these specific epochs. However, I haven't yet resolved this issue.

alala521 commented 4 weeks ago

4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ，here is the training log. 20241101_114911.log

Similar environment 4090 with pytorch 1.12.1+cu113, also have tried another version pytorch 1.13.1+cu116 with spconv-cu116 for fixing a warning but not help eventually.

When adjusting the batch_size to 4, the issue can be avoided, but the results cannot be reproduced as in the paper, with one metric significantly lower than the paper's indicators. It is unclear whether this is due to the batch_size. Currently, the seed has been modified for batch_size=2, and it is being rerun.

songchuanle-1 commented 3 days ago

I'd like to ask you if this problem has been solved.

sosppxo commented 3 days ago

Sorry for the inconvenience caused. However, we are also puzzled as to why this error occurs so early in training. We spent some time reproducing it based on the latest code on 3090 and a800 GPUs, and were able to get close performance, and found that this error tends to appear when the model converges, that is, when the loss and gradient are small. Below I will do my best to troubleshoot the possibility of errors for you:

Have you modified the code in the repo? (Please do not make any modifications to related files including parser, pointnet_ops, etc. Even commenting out seemingly unimportant parts of code can cause runtime errors.)
Please make sure to fix the syntax errors in Multi3DRefer, we provide a reference in the Readme.

sosppxo / MDIN

RuntimeError: Function 'DivBackward0' returned nan values in its 0th output. #11