Problem of loss ‘NAN’ value during training

greatlog / DAN

This is an official implementation of Unfolding the Alternating Optimization for Blind Super Resolution

231 stars 41 forks source link

Problem of loss ‘NAN’ value during training #8

Closed YuqiangY closed 3 years ago

YuqiangY commented 3 years ago

Thanks for your work so much. Like the title, I meet the error that loss value is NAN at 22800 iterations (Second time: 29200) suddenly. Did you have ever met this kind of error?

greatlog commented 3 years ago

Yes, this situation occurs sometimes. A workaround method is to resume from the last normal training state.

YuqiangY commented 3 years ago

Yes, this situation occurs sometimes. A workaround method is to resume from the last normal training state.

Thanks, it is how I solve this mistake. However, do you have some clues about this error?

greatlog commented 3 years ago

Sorry, I have not figured it out yet. If you have any ideas please tell me. Thank you.

YuqiangY commented 3 years ago

Sorry, I have not figured it out yet. If you have any ideas please tell me. Thank you.

I couldn't find the key to the problem, too. In the last 20 hours, this error has occurred several times, when the num of iterations over 115000 especially. Is it common?

greatlog commented 3 years ago

In my case, it occurs twice during 400000 iterations. The frequency seems to be random. Maybe it is an inherent drawback of the proposed method, as DAN is actually a recurrent neural network (RNN). Maybe I can borrow some ideas from RNNs to stabilize the training.

YuqiangY commented 3 years ago

In my case, it occurs twice during 400000 iterations. The frequency seems to be random. Maybe it is an inherent drawback of the proposed method, as DAN is actually a recurrent neural network (RNN). Maybe I can borrow some ideas from RNNs to stabilize the training.

OK，thanks for your reply