jcjohnson / torch-rnn

Efficient, reusable RNNs and LSTMs for torch
MIT License
2.5k stars 508 forks source link

Batchnorm vs dropout #62

Open AlekzNet opened 8 years ago

AlekzNet commented 8 years ago

I wonder if anybody has made any comparisons between batch normalization and dropout, and/or a combination of them. I've seen some papers on that, they state that for text processing the difference is not as pronounced as, for example, for images, but I did not see any proper comparisons.

jcjohnson commented 8 years ago

I didn't do any super controlled experiments, but deeper models with batchnorm do seem to train faster.

AlekzNet commented 8 years ago

I made some comparisons for the 3x128 model (16MB corpus, seq_length=400, batch_size=200, 200 iteration per epoch):

3x128-comparison

The peaks at iteration 400 are related to the learning rate decay (0.92) for the first 3 test runs, I removed the decay for all subsequent runs.

The bottom line - the batch normalization DOES help. E.g. without the BN, the highest possible LR was 0.01, (0.05 with the BN).

Fun! ;)

jcjohnson commented 8 years ago

Those are some beautiful curves - thanks for sharing! It's interesting that the models without BN show a plateau at the beginning - that suggests to me that maybe the initialization is not correct.