Edit: I got the same results on CUDA 8 & 9, and the LightRNN example gives similar performance
The WordLMWithSampledSoftmax Example allows one to run a common "Small" RNN configuration (2 layer 200-dim LSTM) on the (10K vocabulary, ~1M tokens) PTB dataset. This is a typical setup, in which test/validation entropies should level out around 115 PPL (e.g. in Tensorflow).
However, even with full softmax, this implementation doesn't come close to that, instead leveling out around 300 PPL. This remains true across a range of different configurations, such as without momentum, with simple SGD, lower/higher learning rates, different batch sizes etc..
There does not seem to be any other general language modeling tutorial/example or other issue addressing this, nor are the expected outputs on PTB for this example published. Are these results expected? What configuration would replicate other "Small" configs for the PTB dataset? Thanks!
Any comments on this? Is it a mistake on my part? I've preferred CNTK over Tensorflow so far, but this is preventing me from using it for language modeling tasks.
Edit: I got the same results on CUDA 8 & 9, and the LightRNN example gives similar performance
The WordLMWithSampledSoftmax Example allows one to run a common "Small" RNN configuration (2 layer 200-dim LSTM) on the (10K vocabulary, ~1M tokens) PTB dataset. This is a typical setup, in which test/validation entropies should level out around 115 PPL (e.g. in Tensorflow). However, even with full softmax, this implementation doesn't come close to that, instead leveling out around 300 PPL. This remains true across a range of different configurations, such as without momentum, with simple SGD, lower/higher learning rates, different batch sizes etc..
There does not seem to be any other general language modeling tutorial/example or other issue addressing this, nor are the expected outputs on PTB for this example published. Are these results expected? What configuration would replicate other "Small" configs for the PTB dataset? Thanks!