RuntimeError in Training (Function 'DivBackward0' returned nan values in its 0th output and Segmentation Fault)

KESHEN-ZHOU commented 2 months ago

Has anyone encountered similar issues when training the Multi3DRefer? How to fix it?

sk_bce_loss: 0.0045, layer_3_mask_dice_loss: 0.0401, layer_3_sem_loss: 0.0509, layer_3_indi_loss: 0.5398, layer_4_score_loss: 0.0023, layer_4_ma
sk_bce_loss: 0.0054, layer_4_mask_dice_loss: 0.0396, layer_4_sem_loss: 0.0365, layer_4_indi_loss: 0.5150, layer_5_score_loss: 0.0019, layer_5_ma
sk_bce_loss: 0.0054, layer_5_mask_dice_loss: 0.0407, layer_5_sem_loss: 0.0393, layer_5_indi_loss: 0.5039, loss: 1.1819, grad_total_norm: 4.5220 
/home/miniconda3/envs/3d-gres/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in DivBackward0. 
Traceback of forward call that caused the error:                                                                                                
  File "tools/train_3dgres.py", line 323, in <module>                                                                                           
    gorilla.launch(                                                                                                                             
  File "/home/miniconda3/envs/3d-gres/lib/python3.8/site-packages/gorilla/core/launch.py", line 68, in launch                          
    main_func(*args)                                                                                                                            
  File "tools/train_3dgres.py", line 309, in main                                                                                               
    train(epoch, model, train_loader, optimizer, lr_scheduler, cfg, logger, writer)                                                             
  File "tools/train_3dgres.py", line 72, in train                       
    loss, log_vars = model(batch, mode='loss')                          
  File "/home/miniconda3/envs/3d-gres/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)                               
  File "/home/Project/MDIN/gres_model/model/model.py", line 111, in forward                                                            
    return self.loss(**batch)                                           
  File "/home/Project/MDIN/gres_model/utils/utils.py", line 20, in wrapper                                                             
    return func(*new_args, **new_kwargs)                                
  File "/home/Project/MDIN/gres_model/model/model.py", line 132, in loss                                                               
    loss, loss_dict = self.criterion(out, gt_pmasks, gt_spmasks, sp_ref_masks, object_idss, sp_ins_labels, dense_maps, lang_masks, fps_seed_sp, sp_coords_float, batch_offsets)     
  File "/home/miniconda3/envs/3d-gres/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)                               
  File "/home/Project/MDIN/gres_model/model/loss.py", line 950, in forward
      loss_i, loss_out_i = self.get_layer_loss(i, aux_outputs, pad_masks, target, indices, lang_masks, proj_tokens)
  File "/home/Project/MDIN/gres_model/model/loss.py", line 721, in get_layer_loss                                                      
    sem_loss = self.loss_sem_align(proj_tokens, proj_queries, lang_masks, target, indices, num_insts)                                           
  File "/home/Project/MDIN/gres_model/model/loss.py", line 510, in loss_sem_align                                                      
    torch.matmul(norm_img_emb, norm_text_emb.transpose(-1, -2))                                                                                 
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                
Traceback (most recent call last):                                      
  File "tools/train_3dgres.py", line 323, in <module>                                                                                           
    gorilla.launch(                 
  File "/home/miniconda3/envs/3d-gres/lib/python3.8/site-packages/gorilla/core/launch.py", line 68, in launch
    main_func(*args)                
  File "tools/train_3dgres.py", line 309, in main                       
    train(epoch, model, train_loader, optimizer, lr_scheduler, cfg, logger, writer)                                                             
  File "tools/train_3dgres.py", line 86, in train                       
    loss.backward()                 
  File "/home/miniconda3/envs/3d-gres/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)                                                          
  File "/home/miniconda3/envs/3d-gres/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                              
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

HyperbolicCurve commented 2 months ago

you can simply resume the training process by loading the checkpoint. @KESHEN-ZHOU

KESHEN-ZHOU commented 1 month ago

you can simply resume the training process by loading the checkpoint. @KESHEN-ZHOU

Thanks. I have tried that before but not working, still encountered this error after 1-2 epochs.

KESHEN-ZHOU commented 1 month ago

I have also encountered the warnings below when training the model, now I am working on determining whether this warning relates to this issue.

The below two issues are related to the environment setting and should only impact the training performance rather than accuracy. The warning below is related to my Cuda version (4090 with pytorch 1.12.1+cuda11.3 will get this error).

home/Project/MDIN/gres_model/model/loss.py:399: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`                                                                         
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/jit/codegen/cuda/manager.cpp:329.)                                                                  
  mask_dice_loss += dice_loss(pred_mask, tgt_mask.float())

I also found that when training with Multi3DRefer, I got a Segmentation Fault several times.

zaiquanyang commented 3 weeks ago

I have also encountered the warnings below when training the model, now I am working on determining whether this warning relates to this issue.

The below two issues are related to the environment setting and should only impact the training performance rather than accuracy. The warning below is related to my Cuda version (4090 with pytorch 1.12.1+cuda11.3 will get this error).
home/Project/MDIN/gres_model/model/loss.py:399: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`                                                                         
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/jit/codegen/cuda/manager.cpp:329.)                                                                  
  mask_dice_loss += dice_loss(pred_mask, tgt_mask.float())                                                                                                                                
I also found that when training with Multi3DRefer, I got a Segmentation Fault several times.

When I reproduce the code, I met a error : ModuleNotFoundError: No module named 'pointnet2.pointnet2_utils'. Do you know how to solve it?

Liuuuuuyh commented 2 weeks ago

I have also encountered the warnings below when training the model, now I am working on determining whether this warning relates to this issue. The below two issues are related to the environment setting and should only impact the training performance rather than accuracy. The warning below is related to my Cuda version (4090 with pytorch 1.12.1+cuda11.3 will get this error).
home/Project/MDIN/gres_model/model/loss.py:399: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`                                                                         
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/jit/codegen/cuda/manager.cpp:329.)                                                                  
  mask_dice_loss += dice_loss(pred_mask, tgt_mask.float())                                                                                                                                
I also found that when training with Multi3DRefer, I got a Segmentation Fault several times.
When I reproduce the code, I met a error : ModuleNotFoundError: No module named 'pointnet2.pointnet2_utils'. Do you know how to solve it?

We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.

KESHEN-ZHOU commented 2 weeks ago

We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.

@Liuuuuuyh Thank you for your great work. However, I still met errors (mostly Segmentation Faults) when reproducing the result for 3DGRES and the training fails within 2-3 epochs. Resuming from checkpoints or adjusting the config (batch size, num workers for example doesn't help):

scripts/train_3dgres.sh: line 2: 289547 Segmentation fault (core dumped) python tools/train_3dgres.py configs/default_3dgres.yaml --gpu_ids 0 --num_gpus 1

Do you have any possible suggestions on that? Training with 3DRes is more stable and could get a similar result.

Also, I am wondering how it goes for anyone else to reproduce the result on Multi3DRefer? @zaiquanyang @HyperbolicCurve

alala521 commented 1 week ago

We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.

@Liuuuuuyh Thank you for your great work. However, I still met errors (mostly Segmentation Faults) when reproducing the result for 3DGRES and the training fails within 2-3 epochs. Resuming from checkpoints or adjusting the config (batch size, num workers for example doesn't help):

scripts/train_3dgres.sh: line 2: 289547 Segmentation fault (core dumped) python tools/train_3dgres.py configs/default_3dgres.yaml --gpu_ids 0 --num_gpus 1

Do you have any possible suggestions on that? Training with 3DRes is more stable and could get a similar result.

Also, I am wondering how it goes for anyone else to reproduce the result on Multi3DRefer? @zaiquanyang @HyperbolicCurve

Are the metrics you reproduced the same as those in the original paper? The results I reproduced are significantly different from those in the paper, especially in the "zt w/ dis" category.

KESHEN-ZHOU commented 1 week ago

We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.

@Liuuuuuyh Thank you for your great work. However, I still met errors (mostly Segmentation Faults) when reproducing the result for 3DGRES and the training fails within 2-3 epochs. Resuming from checkpoints or adjusting the config (batch size, num workers for example doesn't help): scripts/train_3dgres.sh: line 2: 289547 Segmentation fault (core dumped) python tools/train_3dgres.py configs/default_3dgres.yaml --gpu_ids 0 --num_gpus 1 Do you have any possible suggestions on that? Training with 3DRes is more stable and could get a similar result. Also, I am wondering how it goes for anyone else to reproduce the result on Multi3DRefer? @zaiquanyang @HyperbolicCurve

Are the metrics you reproduced the same as those in the original paper? The results I reproduced are significantly different from those in the paper, especially in the "zt w/ dis" category.

I think for my reproduction result with 3DGRES, I have encountered issues. My training failed within 10 epochs and didn't converge. May I know the training environment you used? The result of 3D RES is similar, with an even slightly higher result compared with the one in the paper.

alala521 commented 1 week ago

We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.对于在上传过程中忽略了 PointNet++ 模块，我们深表歉意。我们更新了安装，以包含有关 PointNet++ 模块的信息。

@Liuuuuuyh Thank you for your great work. However, I still met errors (mostly Segmentation Faults) when reproducing the result for 3DGRES and the training fails within 2-3 epochs. Resuming from checkpoints or adjusting the config (batch size, num workers for example doesn't help):感谢您的出色工作。但是，在为 3DGRES 重现结果时，我仍然遇到了错误（主要是分割错误），并且训练在 2-3 个时期内失败。从 checkpoint 恢复或调整配置（例如批量大小、num worker 没有帮助）： scripts/train_3dgres.sh: line 2: 289547 Segmentation fault (core dumped) python tools/train_3dgres.py configs/default_3dgres.yaml --gpu_ids 0 --num_gpus 1 Do you have any possible suggestions on that? Training with 3DRes is more stable and could get a similar result.你对此有什么建议吗？使用 3DRes 进行训练更稳定，并且可以获得类似的结果。 Also, I am wondering how it goes for anyone else to reproduce the result on Multi3DRefer? 另外，我想知道其他人如何在 Multi3DRefer 上重现结果？@zaiquanyang @HyperbolicCurve

Are the metrics you reproduced the same as those in the original paper? The results I reproduced are significantly different from those in the paper, especially in the "zt w/ dis" category.您复制的指标与原始论文中的指标相同吗？我复制的结果与论文中的结果明显不同，尤其是在 “zt w/ dis” 类别中。

I think for my reproduction result with 3DGRES, I have encountered issues. My training failed within 10 epochs and didn't converge.我认为对于我使用 3DGRES 的复制结果，我遇到了问题。我的训练在 10 个 epoch 内失败，并且没有收敛。 May I know the training environment you used?我能知道你使用的训练环境吗？ The result of 3D RES is similar, with an even slightly higher result compared with the one in the paper.3D RES 的结果相似，与论文中的结果相比，结果甚至略高。

3090+13.1 However, I changed the batch_size to 4 to avoid this problem, but the metrics for reproduction are poor. When I use batch_size=2, the code crashes every 3 rounds. I don't know where the problem is processing the data. If it's convenient, you can add WeChat to facilitate communication: 13280074057.

sosppxo / MDIN

RuntimeError in Training (Function 'DivBackward0' returned nan values in its 0th output and Segmentation Fault) #7