Internal Covariate Shift = difference in data distribution for each batch
Problem:
The convergence of learning becomes slow.
It becomes necessary to carefully determine the parameters of the initial values.
By introducing Batch normalization (average = 0 and the variance = 1 for each batch x feature), learning can be accelerated (a large value can be used for the learning rate.) Moreover, we can reduce the dependency for initial value and the need for Dropout.
Basic Information
Link
https://arxiv.org/abs/1502.03167
Overview
Others
Reference (for understanding)