Open AlekzNet opened 8 years ago
I didn't do any super controlled experiments, but deeper models with batchnorm do seem to train faster.
I made some comparisons for the 3x128 model (16MB corpus, seq_length=400, batch_size=200, 200 iteration per epoch):
The peaks at iteration 400 are related to the learning rate decay (0.92) for the first 3 test runs, I removed the decay for all subsequent runs.
The bottom line - the batch normalization DOES help. E.g. without the BN, the highest possible LR was 0.01, (0.05 with the BN).
Fun! ;)
Those are some beautiful curves - thanks for sharing! It's interesting that the models without BN show a plateau at the beginning - that suggests to me that maybe the initialization is not correct.
I wonder if anybody has made any comparisons between batch normalization and dropout, and/or a combination of them. I've seen some papers on that, they state that for text processing the difference is not as pronounced as, for example, for images, but I did not see any proper comparisons.