kaggle training pipelines don't work

shonenkov commented 4 years ago

🐛 Bug

Approximately 2 days ago my training pipelines were broken with issue, could you help?

I tried to run kernel on kaggle provided by you https://www.kaggle.com/davidelibenzi/simple-xlmr-tpu-pytorch

issue repeated!

I write simple "debug prints" for forked kernel for demonstrating: https://www.kaggle.com/shonenkov/debug-issue-simple-xlmr-tpu-pytorch

Sleeping without any exception after first training step

CPU/TPU are not used during sleeping

To Reproduce

Steps to reproduce the behavior:

rerun kernel https://www.kaggle.com/davidelibenzi/simple-xlmr-tpu-pytorch
observe sleeping without training

Expected behavior

running training steps

Environment

Reproducible on XLA backend [CPU/TPU]:
torch_xla version:
Any other relevant information. In case of issues related to slow executions, re-compilations, etc.., it is useful to have the tar.gz report from https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#using-debug_runpy-to-collect-debug-information

Similar with Kaggle

Additional context

dlibenzi commented 4 years ago

I suggest you to take a look at this thread https://github.com/pytorch/xla/issues/1870 since there has been changes to make it fit the host memory budgets, as well as properly bucketing the inputs.

shonenkov commented 4 years ago

Thanks a lot!

I have used 20200426 version, but kaggle community advised to use 20200420 version. It helps for me.

pytorch / xla