weights become NaN during training

MagicS0ng commented 1 year ago

hello, huzi, I add up some codes to the official Coarse2Fine-PyTorch codes you released. However, it didn't run well. After some epochs, the weights became NaN. I made efforts to find where the error happened. It turned out that the variable stds in GaussianModel happens to contain 0, so I declared an varible eps=1e-8 and add it to stds when stds contains 0. And I found that eps in the official code has been delared and initialized with eps=1e-6 but not used. I guess you may foresee that kind of error, so I come here for some help. : )

huzi96 commented 1 year ago

Hi, so did you fix the problem by adding eps=1e-8? What help do you need now?

MagicS0ng commented 1 year ago

That operation worked. But I regard that as a trick. It just avoids the situation where stds contains 0. I want to figure out the root cause why NaN happens. So could you please give some help about that?

huzi96 commented 1 year ago

This sigma is end-to-end optimized. In some cases it can be very close to zero, and this is where you see NaN. I think adding an eps is a valid solution (and a lot of people actually do that). So maybe you should just use that.

MagicS0ng commented 1 year ago

Really appreciate your help!

huzi96 / Coarse2Fine-PyTorch

weights become NaN during training #16