Open DanielRoeder1 opened 2 years ago
@AlexWertheim do you have cycle to take a look at this one?
Sure, I can take a look!
@DanielRoeder1
Thanks for sharing your model bug with us. Can you please provide the following details so we can reproduce exactly what you experienced on your end. This way we will be able to circle back with concrete steps you can take to improve the problem.
Thanks.
Sure, excuse the late response. You can find the complete training code in the following colab notebooks:
TPU: https://colab.research.google.com/drive/1fSTCbKq7b2iYaDQwrkVe18E81qDZdt3N?usp=sharing
GPU: https://colab.research.google.com/drive/1hW9_pr4B1yDI9sfMs8DRyGUFYQXkybft?usp=sharing
The hyperparameters are the same between both notebooks. The majority of parameters is set in the config.json
❓ Questions and Help
I have trained my transformer model once on a single GPU and once using a multi-core TPU. In both cases a batchsize of 256 is used (times 8 for the TPU). My training results show that the TPU loss after 400 update steps almost equals the GPU loss after 400 updates even though the effective batchsize is 8*times as high. This leads me to believe that the TPU cores are somehow misaligned thus each training their own model (This trend continues). I use a custom learning rate scheduler to update the LR at each training step, see . If I remove this scheduler the training loss during TPU training drops significantly faster but the training becomes very unstable.
In the training loop the optimizer is initialized for each core and as part of the Scheduler which updates the learning rate before each training step:
Any help in mitigating the performance difficulties encountered when using the scheduler is more than welcome! Thanks
Extra Information
Model: "Attention Is All You Need Transformer" (self-coded)
Environment: Colab TPUv2 using torch xla 1.12, Colab GPU T4 (non xla torch)
Train settings: Batch size 256, same LR schedule, same loss function (CrossEntropy), TPU uses 8 cores so 8 * 256 batch
Data: WMT14 4.5million sentences de-en