Open bslin opened 7 years ago
The error gets hit only after some number of iterations. It seems to get hit after fewer iterations I use the adam optimizer rather than the momentum, but that might just be for my specific case. After enough iterations, I get this error regardless of the optimizer I use. The same training/testing data works fine if I use cross entropy as the cost function.
Quick update. Found the issue. There is a bug in layers.py: In pixel_wise_softmax_2 and pixel_wise_softmax
If the output_map is too large, then exponential_map goes to infinity, which causes nan when calculating the cost function.
The following code fixes it, although we might want to find a better value to do the clipping: replace: exponential_map = tf.exp(output_map) with: exponential_map = tf.exp(tf.clip_by_value(output_map, -np.inf, 50))
BTW thanks for providing the tf_unet code. It has been very helpful! :)
Thanks for reporting this. I'm just wondering why the output_map gets so large
Yeah I'm wondering the same thing. I just noticed that I still get garbage results when training my data. (with cross entropy I was getting something more reasonable).
I have no idea why the output_map gets so large, I plan on looking into it some more a little later. Would you happen to have any ideas or theories to look into?
I have also encountered this issue. Using smaller learning rate helped. So maybe it's just an exploding gradient.
Maybe. Another thing I noticed was that to calculate the dice-coefficient, the original code is using both the channels together. When I use only one of the channels, the values I end up getting worked up to be better.
This is a typical issue of overflow/underflow when computing the sum (exp (x)) function. Search 'log sum exp' on the web will give some explanation. The trick is to divide/multiply the same constant before exp function.
Or you can use tf.reduce_logsumexp
or refer to source code of this function.
@weiliu620 thanks for the hint. I'm going to look into this
@weiliu620 following the lines from here refered in your SO question we would just have to subtract the result of tf.reduce_max
in the tf.exp
call, right?
Hi,
I get an error when I tried training with dice coefficient as the ago function. I noticed there was a new commit on this a couple days ago so I suspect it's some bug in the code. Would you know roughly where this might be?
InvalidArgumentError Traceback (most recent call last)