Open fhcbvdjknl opened 1 month ago
Sometimes, resuming the last checkpoint could help. But for me, resuming and retraining only works for 1-2 epochs and eventually gets this DivBackward0 error again or Segmentation Fault error.
Thanks,but it doesn't work and still gets this issue.
Thank you for your feedback. We have also observed similar issues; however, in our experiments, this phenomenon mainly occurs in the later stages of training, once the model has converged, and therefore does not impact final performance. Based on our experience, this issue is less likely to arise in the earlier stages of training. We would appreciate it if you could provide more detailed information (including hardware model, environment configuration, training logs, etc.), so we can conduct a comprehensive investigation into the root cause and work towards a resolution.
4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ,here is the training log. 20241101_114911.log
/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py:200: UserWarning: Error detected in DivBackward0. Traceback of forward call that caused the error:
File "tools/train_3dgres.py", line 323, in
4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ,here is the training log. 20241101_114911.log
Similar environment 4090 with pytorch 1.12.1+cu113, also have tried another version pytorch 1.13.1+cu116 with spconv-cu116 for fixing a warning but not help eventually.
4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ,here is the training log.4090 和 PyTorch 2.0.0+cuda11.8。 使用 3DRes 进行训练是稳定的,并且可以获得类似的结果,但是使用 3DRes 进行训练会遇到这个问题,这是训练日志。 20241101_114911.log
4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ,here is the training log. 20241101_114911.log4090 和 PyTorch 2.0.0+cuda11.8。使用 3DRes 进行训练是稳定的,并且可以获得类似的结果,但是使用 3DRes 进行训练会遇到这个问题,这是训练日志。20241101_114911.log
Similar environment 4090 with pytorch 1.12.1+cu113, also have tried another version pytorch 1.13.1+cu116 with spconv-cu116 for fixing a warning but not help eventually.类似的环境 4090 与 pytorch 1.12.1+cu113 相同,也尝试了另一个版本 pytorch 1.13.1+cu116 和 spconv-cu116 来修复警告,但最终没有帮助。
I'm encountering the same issue: the training log is identical, and an issue occurs at the 820th batch in the 3rd epoch, causing a crash. When I resume training, the same issue reappears at the 820th batch in the 5th epoch, leading to another crash, and it crashes again at the 820th batch in the 7th epoch. It’s clear that using the same seed for resuming is causing this issue, as the code fails every time it processes the 820th batch in these specific epochs. However, I haven't yet resolved this issue.
4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ,here is the training log.4090 和 PyTorch 2.0.0+cuda11.8。 使用 3DRes 进行训练是稳定的,并且可以获得类似的结果,但是使用 3DRes 进行训练会遇到这个问题,这是训练日志。 20241101_114911.log
I'm encountering the same issue: the training log is identical, and an issue occurs at the 820th batch in the 3rd epoch, causing a crash. When I resume training, the same issue reappears at the 820th batch in the 5th epoch, leading to another crash, and it crashes again at the 820th batch in the 7th epoch. It’s clear that using the same seed for resuming is causing this issue, as the code fails every time it processes the 820th batch in these specific epochs. However, I haven't yet resolved this issue.
4090 with pytorch 2.0.0+cuda11.8. Training with 3DRes is stable and could get a similar result, but training with 3DGRes meets this issue ,here is the training log. 20241101_114911.log
Similar environment 4090 with pytorch 1.12.1+cu113, also have tried another version pytorch 1.13.1+cu116 with spconv-cu116 for fixing a warning but not help eventually.
When adjusting the batch_size to 4, the issue can be avoided, but the results cannot be reproduced as in the paper, with one metric significantly lower than the paper's indicators. It is unclear whether this is due to the batch_size. Currently, the seed has been modified for batch_size=2, and it is being rerun.
I'd like to ask you if this problem has been solved.
Sorry for the inconvenience caused. However, we are also puzzled as to why this error occurs so early in training. We spent some time reproducing it based on the latest code on 3090 and a800 GPUs, and were able to get close performance, and found that this error tends to appear when the model converges, that is, when the loss and gradient are small. Below I will do my best to troubleshoot the possibility of errors for you:
I meet this issue when training the Multi3DRefer within 2-3 epochs,how can I resolve this issue?