Hi, fyi for anyone who gets NaN during training with a model that uses LayerNormlization.
The current implementation in cntk has an edge case that causes NaN:
def layer_normalize(x):
mean = reduce_mean(x) # normalize w.r.t. actual sample statistics
x0 = x - mean;
std = sqrt (reduce_mean (x0 * x0)) # EDGE CASE: you need the epsilon inside the sqrt!
if (epsilon != 0):
std += epsilon
x_hat = x0 / std
return x_hat * scale + bias # denormalize with learned parameters
In the edge case, reduce_mean(x0 * x0) would return a negative value! This is probably due to some rounding error. Unfortunately, sqrt of a negative value will immediately cause NaN. So the solution is to shift the epsilon into the sqrt instead.
I have already resolved this in my library cntkx. You can install it with a pip install cntkx and from cntkx.layers import LayerNormlization and everything will work fine. cntkx is written in pure python so there's no dependency issues too.
Hi, fyi for anyone who gets NaN during training with a model that uses
LayerNormlization
.The current implementation in cntk has an edge case that causes
NaN
:In the edge case,
reduce_mean(x0 * x0)
would return a negative value! This is probably due to some rounding error. Unfortunately, sqrt of a negative value will immediately causeNaN
. So the solution is to shift the epsilon into the sqrt instead.I have already resolved this in my library cntkx. You can install it with a
pip install cntkx
andfrom cntkx.layers import LayerNormlization
and everything will work fine. cntkx is written in pure python so there's no dependency issues too.