Open castelo-software opened 5 years ago
Maybe it caused by reshape(-1) before the calculate L1 loss
@cch98 Removing reshape(-1) have not effect.
It's not a problem with reshaping. The core of the problem is that the standard deviation done in the AdaIN layer contains a square root which sometimes gets an invalid input and therefore has a gradient of NaN.
It's not an error directly in the Loss_MCH function as far as I know, but activating it causes a chain reaction that results in the embedded vector to have dangerous values. I tried to fix it by adding relus in the generator, but it hasn't done the trick.
@MrCaracara,
I have been trying to troubleshoot this issue as well. I was able to trace it back to the values of the e_hat matrix (ln: 102) in run.py. They slowly reduce to zero then nan in each iteration, which causes your adv loss to have a nan value. I am stuck at this point...
For all of you interested, the solution proposed in Issue #32 also works for me, at least for now the training works fine. However, I'm not sure why this seems to work, since Tensor.sqrt() can also yield NaN.
For all of you interested, the solution proposed in Issue #32 also works for me, at least for now the training works fine. However, I'm not sure why this seems to work, since Tensor.sqrt() can also yield NaN.
This does prevent the losses from getting too low, but now my generator gets messed up. I ran the training for the last couple of days, and now all the generator does is to produce plain-colored brown images. I did not observer the training, so I am not sure why this happened.
For all of you interested, the solution proposed in Issue #32 also works for me, at least for now the training works fine. However, I'm not sure why this seems to work, since Tensor.sqrt() can also yield NaN.
This does prevent the losses from getting too low, but now my generator gets messed up. I ran the training for the last couple of days, and now all the generator does is to produce plain-colored brown images. I did not observer the training, so I am not sure why this happened.
Faced the same behavior.
For all of you interested, the solution proposed in Issue #32 also works for me, at least for now the training works fine. However, I'm not sure why this seems to work, since Tensor.sqrt() can also yield NaN.
The solution suggested contains a little mistake. The 'self.eps' should be added after taking the square root. If you do it this way, the NaN losses occur as before, which makes sense, since it should be equivalent to using torch.std().
It seems that the weight of the MCH loss is 8 times more than in the original paper.
Has someone finally able to fix this issue totally?
It seems that when LossMCH is turned on (
FEED_FORWARD=False
inconfig.py
), the losses becomenan
and the network produces black images.I haven't had time to debug and find out why.