Open KESHEN-ZHOU opened 2 months ago
you can simply resume the training process by loading the checkpoint. @KESHEN-ZHOU
you can simply resume the training process by loading the checkpoint. @KESHEN-ZHOU
Thanks. I have tried that before but not working, still encountered this error after 1-2 epochs.
I have also encountered the warnings below when training the model, now I am working on determining whether this warning relates to this issue.
The below two issues are related to the environment setting and should only impact the training performance rather than accuracy. The warning below is related to my Cuda version (4090 with pytorch 1.12.1+cuda11.3 will get this error).
home/Project/MDIN/gres_model/model/loss.py:399: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
(Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/jit/codegen/cuda/manager.cpp:329.)
mask_dice_loss += dice_loss(pred_mask, tgt_mask.float())
I also found that when training with Multi3DRefer, I got a Segmentation Fault several times.
I have also encountered the warnings below when training the model, now I am working on determining whether this warning relates to this issue.
The below two issues are related to the environment setting and should only impact the training performance rather than accuracy. The warning below is related to my Cuda version (4090 with pytorch 1.12.1+cuda11.3 will get this error).
home/Project/MDIN/gres_model/model/loss.py:399: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback` (Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/jit/codegen/cuda/manager.cpp:329.) mask_dice_loss += dice_loss(pred_mask, tgt_mask.float())
I also found that when training with Multi3DRefer, I got a Segmentation Fault several times.
When I reproduce the code, I met a error : ModuleNotFoundError: No module named 'pointnet2.pointnet2_utils'. Do you know how to solve it?
I have also encountered the warnings below when training the model, now I am working on determining whether this warning relates to this issue. The below two issues are related to the environment setting and should only impact the training performance rather than accuracy. The warning below is related to my Cuda version (4090 with pytorch 1.12.1+cuda11.3 will get this error).
home/Project/MDIN/gres_model/model/loss.py:399: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback` (Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/jit/codegen/cuda/manager.cpp:329.) mask_dice_loss += dice_loss(pred_mask, tgt_mask.float())
I also found that when training with Multi3DRefer, I got a Segmentation Fault several times.
When I reproduce the code, I met a error : ModuleNotFoundError: No module named 'pointnet2.pointnet2_utils'. Do you know how to solve it?
We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.
We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.
@Liuuuuuyh Thank you for your great work. However, I still met errors (mostly Segmentation Faults) when reproducing the result for 3DGRES and the training fails within 2-3 epochs. Resuming from checkpoints or adjusting the config (batch size, num workers for example doesn't help):
scripts/train_3dgres.sh: line 2: 289547 Segmentation fault (core dumped) python tools/train_3dgres.py configs/default_3dgres.yaml --gpu_ids 0 --num_gpus 1
Do you have any possible suggestions on that? Training with 3DRes is more stable and could get a similar result.
Also, I am wondering how it goes for anyone else to reproduce the result on Multi3DRefer? @zaiquanyang @HyperbolicCurve
We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.
@Liuuuuuyh Thank you for your great work. However, I still met errors (mostly Segmentation Faults) when reproducing the result for 3DGRES and the training fails within 2-3 epochs. Resuming from checkpoints or adjusting the config (batch size, num workers for example doesn't help):
scripts/train_3dgres.sh: line 2: 289547 Segmentation fault (core dumped) python tools/train_3dgres.py configs/default_3dgres.yaml --gpu_ids 0 --num_gpus 1
Do you have any possible suggestions on that? Training with 3DRes is more stable and could get a similar result.
Also, I am wondering how it goes for anyone else to reproduce the result on Multi3DRefer? @zaiquanyang @HyperbolicCurve
Are the metrics you reproduced the same as those in the original paper? The results I reproduced are significantly different from those in the paper, especially in the "zt w/ dis" category.
We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.
@Liuuuuuyh Thank you for your great work. However, I still met errors (mostly Segmentation Faults) when reproducing the result for 3DGRES and the training fails within 2-3 epochs. Resuming from checkpoints or adjusting the config (batch size, num workers for example doesn't help):
scripts/train_3dgres.sh: line 2: 289547 Segmentation fault (core dumped) python tools/train_3dgres.py configs/default_3dgres.yaml --gpu_ids 0 --num_gpus 1
Do you have any possible suggestions on that? Training with 3DRes is more stable and could get a similar result. Also, I am wondering how it goes for anyone else to reproduce the result on Multi3DRefer? @zaiquanyang @HyperbolicCurveAre the metrics you reproduced the same as those in the original paper? The results I reproduced are significantly different from those in the paper, especially in the "zt w/ dis" category.
I think for my reproduction result with 3DGRES, I have encountered issues. My training failed within 10 epochs and didn't converge. May I know the training environment you used? The result of 3D RES is similar, with an even slightly higher result compared with the one in the paper.
We apologize for overlooking the PointNet++ module during the upload. We have updated the installation to include information regarding the PointNet++ module.对于在上传过程中忽略了 PointNet++ 模块,我们深表歉意。我们更新了安装,以包含有关 PointNet++ 模块的信息。
@Liuuuuuyh Thank you for your great work. However, I still met errors (mostly Segmentation Faults) when reproducing the result for 3DGRES and the training fails within 2-3 epochs. Resuming from checkpoints or adjusting the config (batch size, num workers for example doesn't help):感谢您的出色工作。但是,在为 3DGRES 重现结果时,我仍然遇到了错误(主要是分割错误),并且训练在 2-3 个时期内失败。从 checkpoint 恢复或调整配置(例如批量大小、num worker 没有帮助):
scripts/train_3dgres.sh: line 2: 289547 Segmentation fault (core dumped) python tools/train_3dgres.py configs/default_3dgres.yaml --gpu_ids 0 --num_gpus 1
Do you have any possible suggestions on that? Training with 3DRes is more stable and could get a similar result.你对此有什么建议吗?使用 3DRes 进行训练更稳定,并且可以获得类似的结果。 Also, I am wondering how it goes for anyone else to reproduce the result on Multi3DRefer? 另外,我想知道其他人如何在 Multi3DRefer 上重现结果?@zaiquanyang @HyperbolicCurveAre the metrics you reproduced the same as those in the original paper? The results I reproduced are significantly different from those in the paper, especially in the "zt w/ dis" category.您复制的指标与原始论文中的指标相同吗?我复制的结果与论文中的结果明显不同,尤其是在 “zt w/ dis” 类别中。
I think for my reproduction result with 3DGRES, I have encountered issues. My training failed within 10 epochs and didn't converge.我认为对于我使用 3DGRES 的复制结果,我遇到了问题。 我的训练在 10 个 epoch 内失败,并且没有收敛。 May I know the training environment you used?我能知道你使用的训练环境吗? The result of 3D RES is similar, with an even slightly higher result compared with the one in the paper.3D RES 的结果相似,与论文中的结果相比,结果甚至略高。
3090+13.1 However, I changed the batch_size to 4 to avoid this problem, but the metrics for reproduction are poor. When I use batch_size=2, the code crashes every 3 rounds. I don't know where the problem is processing the data. If it's convenient, you can add WeChat to facilitate communication: 13280074057.
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output. Hello, I'd like to ask you if this problem has been solved.@KESHEN-ZHOU
Has anyone encountered similar issues when training the Multi3DRefer? How to fix it?