Training on Billion Word Corpus

bjascob commented 6 years ago

I've been working on training high quality LM (ie.. perplexity < 40) based on the models described in "Estimation of gap between current Language Models and human performance," from LSV at Saarland. I'm assuming that the code in this toolkit was used, at least in part, for that paper. In trying to replicate some of those results using the Billion word corpus I'm running into questions on specifics. I'm hoping you might be able to offer some insights and save me a lot of time optimizing this very slow training process.

1 - Are the full sets of parameters used to train the LSTM models available anywhere? Specifically the learning rate / decay is of great interest. Also things like batch size would be nice to know.

2- Was there a specific reason to use plain "Gradient Descent" as an optimizer? In my experience Adam and ADAGrad seem to be the most popular now days, specifically because they eliminate the need to manually adjust learning rates (and TensorFlow has these implemented).

3 - It looks like the models were trained on the full 800M word corpus. With my TitanX, a single pass through all that data on the larger models looks like it would take about 3 days. That's a long time inside a single epoch with the same LR and no checkpoints. Did you split the training data across multiple epochs or use any other technique to optimize the process?

Thanks.

chin-gyou commented 6 years ago

Sorry for the confusion. The original paper is based on a different implementation in https://github.com/chin-gyou/lstm. You can see the default setting on the README. For that experiment, we use AdaGrad. It is trained on the full corpus. We trained it on 3 Titan X GPUs and the largest model takes less than one day to finish one epoch.

bjascob commented 6 years ago

Thanks. I'll try that code. That should help a lot.

uds-lsv / TF-NNLM-TK

Training on Billion Word Corpus #5