kimiyoung / transformer-xl

Apache License 2.0
3.61k stars 763 forks source link

Different training steps in tf and pytorch #114

Closed richardbaihe closed 3 years ago

richardbaihe commented 4 years ago

Hi, I notice that the training steps for the base_wt103 in PyTorch codes is 200K, while this number is 400K in the TF scripts. However, for the large wt103, both of them are 4M.

I am confused about the training steps as I am training the large PyTorch model with 16*32GB v100. The speed is too slow to finish the 4000000 steps(1300ms per step,2 months, is it right? ).

By the way, will the tf codes be faster than PyTorch in this project?

Thanks for your help!

menghuanlater commented 4 years ago

我的训练时间与这差不多,感觉太慢了,TF的速度与这差别并不大

tonytan48 commented 4 years ago

@richardbaihe @menghuanlater Maybe 4 million is a typo. Did you try with training the transformer-xl model with 400K steps? In the paper, the authors mentioned they also just trained One billion word dataset for 400K steps, which is a much larger dataset. @kimiyoung Hi, could you help to clarify on the training steps and maybe share some ideas on what is the GPU hours for training transformerXL large?

richardbaihe commented 3 years ago

Hi @tonytan48, I didn't train Transformer-xl large. But I trained my own Segatron-XL large. I got 18.3 ppl within 172k steps and 17.1 ppl within 352k steps. I guess Transformer-XL large model needs 400k to get 18.3 ppl instead of 4M.