Open jotron opened 2 years ago
Thanks @jotron! I was able to reproduce this error. Would you please use 1.10 for now, and we'll fix this asap and let you know!
Thank you for taking care of it @miaoshasha! I wouldn't exclude the bug happening in 1.10 as well though. I can't rerun it right now, but I remember it happening to me in 1.10 as well, but after 50 Epochs. Any progress?
debugging is in progress, I will report back when I have some better idea.
Thanks for reporting this bug. I will work on it later in the week.
🐛 Bug
I have a GCE TPU-VM setup with Pytorch 1.11. Running the provided test_train_mp_imagenet.py on the ImageNet dataset with 8 cores leads to an OOM error after a nondeterministic amount of epochs (e.g. 7 epochs).
To Reproduce
I am using a TPU-VM on GCE with software version pu-vm-pt-1.11. I have attached an SSD disk with ImageNet data.
Steps to reproduce the behavior:
export XRT_TPU_CONFIG='localservice;0;localhost:51011'
git clone https://github.com/pytorch/xla.git
pip install tensorboardX
Expected behavior
I expect the the script to run for all epochs until completion. That is, logs of the form:
until the 90th epoch. I read logs like this until epoch 7.
Error
The full error log is logfull.txt Additionally I ran:
The archive produced can be found on google drive.