Loss is NaN - Githubissues

csyhy1986 commented 3 years ago

Thanks for your wonderful work. I trained d2-net on a new dataset and found that there is a 'nan' problem of loss. I knew someone also found this problem (https://github.com/mihaidusmanu/d2-net/issues/57), and i tried his solution but it did not work. So I debug the source code and found that the problem caused by dividing by zero. " depth_wise_max_score = batch / depth_wise_max.unsqueeze(1) " in model.py, I modified this line by " depth_wise_max_score = batch / (depth_wise_max.unsqueeze(1)+1e-8) ", and nan problem did not happen again. However, I have a question: My solution to the problem is purely mathematic, whether there is a default setting when some elements in depth_wise_max is zero?

mihaidusmanu commented 3 years ago

Hello. Adding epsilon to these divisions seems like a good solution to me. It is quite unlikely that a descriptor is perfectly 0, but it seems to happen sometimes during training. I think it depends a lot on the dataset / backbone architecture you are using: when training a VGG backbone on MegaDepth we never ran into this. I don't think there's a way to avoid this behaviour completely.

Some recent follow-ups suggest switching to a different formulation using softplus which solves the issue. Please refer to https://arxiv.org/pdf/2003.10071.pdf for more details (notably Eq. 11).

csyhy1986 commented 3 years ago

Hello. Adding epsilon to these divisions seems like a good solution to me. It is quite unlikely that a descriptor is perfectly 0, but it seems to happen sometimes during training. I think it depends a lot on the dataset / backbone architecture you are using: when training a VGG backbone on MegaDepth we never ran into this. I don't think there's a way to avoid this behaviour completely.

Some recent follow-ups suggest switching to a different formulation using softplus which solves the issue. Please refer to https://arxiv.org/pdf/2003.10071.pdf for more details (notably Eq. 11).

Thanks for replying. I think the problem may caused by the training data, becuase I noticed that this only happened in images of calm water surface. These images were captured by remote sensing sensors, so every pixel in these images have the same gray level, and this may caused descriptors becoming a zero vector.

mihaidusmanu / d2-net

Loss is NaN #78