LayerNormlization layer has an edge case that causes NaN

Hi, fyi for anyone who gets NaN during training with a model that uses LayerNormlization.

The current implementation in cntk has an edge case that causes NaN:

  def layer_normalize(x):
        mean = reduce_mean(x) # normalize w.r.t. actual sample statistics
        x0 = x - mean;
        std = sqrt (reduce_mean (x0 * x0))  # EDGE CASE: you need the epsilon inside the sqrt!
        if (epsilon != 0):
            std += epsilon
        x_hat = x0 / std
        return x_hat * scale + bias    # denormalize with learned parameters

In the edge case, reduce_mean(x0 * x0) would return a negative value! This is probably due to some rounding error. Unfortunately, sqrt of a negative value will immediately cause NaN. So the solution is to shift the epsilon into the sqrt instead.

I have already resolved this in my library cntkx. You can install it with a pip install cntkx and from cntkx.layers import LayerNormlization and everything will work fine. cntkx is written in pure python so there's no dependency issues too.

microsoft / CNTK

LayerNormlization layer has an edge case that causes NaN #3816