Find extremely big pts_loss output when trying to set cudatoolkit=11.2

visinf / multi-mono-sf

Self-Supervised Multi-Frame Monocular Scene Flow (CVPR 2021)

Apache License 2.0

99 stars 17 forks source link

Find extremely big pts_loss output when trying to set cudatoolkit=11.2 #7

Closed Yang-Hao-Lin closed 2 years ago

Yang-Hao-Lin commented 2 years ago

Hi Junhwa, thanks a lot for the awesome work in the field of scene flow :) When l tried to use your loss in my environment setting (cudatoolkit=11.2), the pts_loss(s_3) became extremely big, for example, bigger than 100000. But when I set the version of cudatoolkit as 10.2, the output of pts_loss became normal (usually less than 10). I can not find out the reason. Have you ever met this kind of situation? Again, thanks a lot.

hurjunhwa commented 2 years ago

Hi, thank you for your interest in our work! I think it happened when the disparity decoder is not properly trained, and there can be multiple reasons such as training instability, invalid inputs, etc.. I would first test the pre-trained model using with the version 11.2 and check if the scene flow accuracy matches the baseline's.

Probably doing the unit-tests of cuda-dependent modules can be also necessary. Please try to use the python implementation of the correlation layer (--correlation_cuda_enabled=False) and also check if the softsplat works well.

If it still doesn't work, please let me know!

Yang-Hao-Lin commented 2 years ago

Hi Junhwa. Thank a lot for your attention! The server administrator of my lab updated my GPU from Tesla P40 to RTX A40 yesterday. I test in the environment setting of cudatoolkit=11.2 again, and now the output of pts_loss is within a reasonable range. So, now the situation is:

Test on GPU P40, cudatoolkit=10.2 -> pts_loss is within a reasonable range; Test on GPU P40, cudatoolkit=11.2 -> pts_loss is extremely big, either correlation_cuda_enabled equals False or equals True, even on a supervised pretrained model; Test on GPU A40, cudatoolkit=11.2 -> pts_loss is within a reasonable range.

It is weird, but there seems to be some compatibility issue between Tesla P40 and cudatoolkit=11.2.

hurjunhwa commented 2 years ago

Thank you for sharing them!