Poor word LM results on PTB

Edit: I got the same results on CUDA 8 & 9, and the LightRNN example gives similar performance

The WordLMWithSampledSoftmax Example allows one to run a common "Small" RNN configuration (2 layer 200-dim LSTM) on the (10K vocabulary, ~1M tokens) PTB dataset. This is a typical setup, in which test/validation entropies should level out around 115 PPL (e.g. in Tensorflow). However, even with full softmax, this implementation doesn't come close to that, instead leveling out around 300 PPL. This remains true across a range of different configurations, such as without momentum, with simple SGD, lower/higher learning rates, different batch sizes etc..

There does not seem to be any other general language modeling tutorial/example or other issue addressing this, nor are the expected outputs on PTB for this example published. Are these results expected? What configuration would replicate other "Small" configs for the PTB dataset? Thanks!

microsoft / CNTK

Poor word LM results on PTB #2799