Closed shonenkov closed 4 years ago
I suggest you to take a look at this thread https://github.com/pytorch/xla/issues/1870 since there has been changes to make it fit the host memory budgets, as well as properly bucketing the inputs.
Thanks a lot!
I have used 20200426 version, but kaggle community advised to use 20200420 version. It helps for me.
🐛 Bug
Approximately 2 days ago my training pipelines were broken with issue, could you help?
I tried to run kernel on kaggle provided by you https://www.kaggle.com/davidelibenzi/simple-xlmr-tpu-pytorch
issue repeated!
I write simple "debug prints" for forked kernel for demonstrating: https://www.kaggle.com/shonenkov/debug-issue-simple-xlmr-tpu-pytorch
Sleeping without any exception after first training step
CPU/TPU are not used during sleeping
To Reproduce
Steps to reproduce the behavior:
Expected behavior
running training steps
Environment
tar.gz
report from https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#using-debug_runpy-to-collect-debug-informationSimilar with Kaggle
Additional context