Basic Information

Link

Internal Covariate Shift = difference in data distribution for each batch
Problem:
- The convergence of learning becomes slow.
- It becomes necessary to carefully determine the parameters of the initial values.
By introducing Batch normalization (average = 0 and the variance = 1 for each batch x feature), learning can be accelerated (a large value can be used for the learning rate.) Moreover, we can reduce the dependency for initial value and the need for Dropout.