howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Normalization Techniques in Training DNNs: Methodology, Analysis and Application #73

Open howardyclo opened 3 years ago

howardyclo commented 3 years ago

Metadata

TL;DR

This paper reviews the past, present and future of normalization methods for DNNs training, and aims to answer the following questions:

  1. What are the main motivations behind different normalization methods in DNNs, and how can we present a taxonomy for understanding the similarities and differences between a wide variety of approaches?
  2. How can we reduce the gap between the empirical success of normalization techniques and our theoretical understanding of them?
  3. What recent advances have been made in designing/tailoring normalization techniques for different tasks, and what are the main insights behind them?

Introduction

Normalization techniques typically serves as a "layer" between learnable weights and activations in DNN architectures. More importantly, they've advanced deep learning research and become an essential module in DNN architectures for various applications. For example, Layer Normalization (LN) for transformers used in NLP; Spectral Normalization (SN) for discriminator in GANs used in generative modeling.

Question 1

Five normalization operations considered

Motivation of normalization

Convergence is proved to be related to the statistics fo input of a linear model, e.g., if the Hessian of input to a linear model is identity matrix, then this linear model can converge within one iteration by full gradient descent (GD). Several normalizations are discussed:

  1. Normalizing the activations (non-learnable or learnable)
  2. Normalizing the weights with a constrained distribution such that activations' gradients are implicitly normalized. These are inspired by weight normalizations but are extended towards satisfying the desired properties during training.
  3. Normalizing the gradients to exploit the curvature information for GD/SGD.

Normalization framework Π -> Φ -> Ψ

Take batch normalization (BN; ICML'15) for example, for a given input channel-first batch X with shape (c, b, h, w)

  1. Normalization area partitioning (Π): (c, b, h, w) -> (c, b*h*w)
  2. Normalization operation (Φ): Standardization along the last dimension of (c, b* h*w)
  3. Normalization representation recovery (Ψ): Affine transformation with learnable parameters for X.

Several weakness of BN

  1. Inconsistent between training and inference limits its usage in complex networks, such as RNN or GANs.
  2. Suffers from small batch size setting (e.g., object detection and segmentation) To address weakness of BN, several normalizations have been proposed and we discussed them under the framework.

Normalization area partitioning

TBD