Closed richardbaihe closed 3 years ago
我的训练时间与这差不多,感觉太慢了,TF的速度与这差别并不大
@richardbaihe @menghuanlater Maybe 4 million is a typo. Did you try with training the transformer-xl model with 400K steps? In the paper, the authors mentioned they also just trained One billion word dataset for 400K steps, which is a much larger dataset. @kimiyoung Hi, could you help to clarify on the training steps and maybe share some ideas on what is the GPU hours for training transformerXL large?
Hi @tonytan48, I didn't train Transformer-xl large. But I trained my own Segatron-XL large. I got 18.3 ppl within 172k steps and 17.1 ppl within 352k steps. I guess Transformer-XL large model needs 400k to get 18.3 ppl instead of 4M.
Hi, I notice that the training steps for the base_wt103 in PyTorch codes is 200K, while this number is 400K in the TF scripts. However, for the large wt103, both of them are 4M.
I am confused about the training steps as I am training the large PyTorch model with 16*32GB v100. The speed is too slow to finish the 4000000 steps(1300ms per step,2 months, is it right? ).
By the way, will the tf codes be faster than PyTorch in this project?
Thanks for your help!