Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

http://arxiv.org/pdf/1502.03167v3.pdf

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

Summary:

Dataset: MNIST, ImageNet.
Objective: Network training is very sensitive to learning rate and initialization factors. Each layer output distribution is different than its input distribution (called covariate shift) which implies that layers have to permanently adapt to new input distribution. In this paper the author introduce batch normalization, a new layer to reduce covariate shift.

Inner workings:

Batch normalization fixes the means and variances of layer inputs for a training batch by computing the following normalization on each batch.

The parameters Gamma and Beta are then learned with a gradient descent. During inference the statistics are computed using unbiased estimators of the whole dataset (and not just the batch).

Results:

Batch normalization provides several advantages:

Use of a higher learning rate without risk of divergence by stabilizing the gradient scale.
Regularizes the model.
Reduces the need for dropout.
Avoid the network to get stuck when using saturating nonlinearities.

What to do?

Add batch norm layer before activation layers.
Increase the learning rate.
Remove dropout.
Reduce L2 weight regularization.
Accelerate learning rate decay.
Reduce picture distorsion for data augmentation.

leo-p / papers