microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.49k stars 4.3k forks source link

LayerNormlization layer has an edge case that causes NaN #3816

Open delzac opened 4 years ago

delzac commented 4 years ago

Hi, fyi for anyone who gets NaN during training with a model that uses LayerNormlization.

The current implementation in cntk has an edge case that causes NaN:

  def layer_normalize(x):
        mean = reduce_mean(x) # normalize w.r.t. actual sample statistics
        x0 = x - mean;
        std = sqrt (reduce_mean (x0 * x0))  # EDGE CASE: you need the epsilon inside the sqrt!
        if (epsilon != 0):
            std += epsilon
        x_hat = x0 / std
        return x_hat * scale + bias    # denormalize with learned parameters

In the edge case, reduce_mean(x0 * x0) would return a negative value! This is probably due to some rounding error. Unfortunately, sqrt of a negative value will immediately cause NaN. So the solution is to shift the epsilon into the sqrt instead.

I have already resolved this in my library cntkx. You can install it with a pip install cntkx and from cntkx.layers import LayerNormlization and everything will work fine. cntkx is written in pure python so there's no dependency issues too.