pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.47k stars 478 forks source link

kaggle training pipelines don't work #2001

Closed shonenkov closed 4 years ago

shonenkov commented 4 years ago

🐛 Bug

Approximately 2 days ago my training pipelines were broken with issue, could you help?

I tried to run kernel on kaggle provided by you https://www.kaggle.com/davidelibenzi/simple-xlmr-tpu-pytorch

issue repeated!

I write simple "debug prints" for forked kernel for demonstrating: https://www.kaggle.com/shonenkov/debug-issue-simple-xlmr-tpu-pytorch

Sleeping without any exception after first training step

CPU/TPU are not used during sleeping

To Reproduce

Steps to reproduce the behavior:

  1. rerun kernel https://www.kaggle.com/davidelibenzi/simple-xlmr-tpu-pytorch
  2. observe sleeping without training

Expected behavior

running training steps

Environment

Similar with Kaggle

Additional context

dlibenzi commented 4 years ago

I suggest you to take a look at this thread https://github.com/pytorch/xla/issues/1870 since there has been changes to make it fit the host memory budgets, as well as properly bucketing the inputs.

shonenkov commented 4 years ago

Thanks a lot!

I have used 20200426 version, but kaggle community advised to use 20200420 version. It helps for me.