Something wrong in training --NaN or Inf found in input tensor.

liuyuan-pal / Gen6D

[ECCV2022] Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images

GNU General Public License v3.0

592 stars 74 forks source link

Something wrong in training --NaN or Inf found in input tensor. #68

Open dengxy2000 opened 1 year ago

dengxy2000 commented 1 year ago

Thanks for your excellent work! When I run source code for training models on my own device(RTX 3090),something wrong with loss value,it seems not be a mistake with the dataset,could you help me with this problem? 微信图片_20230406175655 部署到另一台P5000上不会出现这种情况，是卡的问题吗？ @liuyuan-pal

liuyuan-pal commented 1 year ago

这个我也不太清楚，你要不看看是从哪一个step开始出现了NaN。

if torch.sum(torch.isnan(some_tensors))>0:
    import ipdb; ipdb.set_trace()

dengxy2000 commented 1 year ago

这个我也不太清楚，你要不看看是从哪一个step开始出现了NaN。
if torch.sum(torch.isnan(some_tensors))>0:
    import ipdb; ipdb.set_trace()

感谢你的帮助hh,问题已经解决了,我将torch的版本从1.12降低到1.10就能正常work了

dengxy2000 commented 1 year ago

你好，还有一个想问的问题就是我在使用源码的默认参数下训练我的refiner，尽管增加total_setp到600k也无法得到与论文中相当的精度的结果，我的detector和selector应该是没啥问题的，请问你方便披露一下refiner的训练细节吗，是在默认配置下只需要260000个step就取得了best_model吗，谢谢！ @liuyuan-pal

liuyuan-pal commented 1 year ago

我这里有用200张cat的图片来做validation set，然后根据validation上面的performance选择一个最好的refiner。

这里在projection-2d这个metric是比较容易复现的，但是ADD那个metric随机性很大，差一两个像素就会差特别远，所以ADD不是很稳定，比较难以复现。