Exception in device=TPU:0: torch_xla/csrc/helpers.cpp:510

pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)

https://pytorch.org/xla

Other

2.49k stars 481 forks source link

Exception in device=TPU:0: torch_xla/csrc/helpers.cpp:510 #2461

Closed IamSparky closed 4 years ago

IamSparky commented 4 years ago

Issue description

Facing the issue Exception in device=TPU:0: torch_xla/csrc/helpers.cpp:510 while training in kaggle TPU

Code example

Here's the link to my notebook https://www.kaggle.com/soumochatterjee/cutmix-flower-classification

taylanbil commented 4 years ago

Looking @ the traceback the problem is in this line:

    running_corrects += torch.sum(preds == labels.data)

Seems like the dimensions don't match in preds and labels.data.

IamSparky commented 4 years ago

@taylanbil I modified the code but it starts to break in the 2nd epoch , can not able to trace the error. https://www.kaggle.com/soumochatterjee/cutmix-flower-classification

taylanbil commented 4 years ago

same error or different?

IamSparky commented 4 years ago

Now , error is not coming but training stops on second iteration please help.. its difficult to debug

taylanbil commented 4 years ago

Sorry, I don't understand; the notebook shows it trained for 20+ epochs but the accuracy is 0.

Reading the code, I believe the culprit is the usage of pl.ParallelLoader instead of pl.MpDeviceLoader. If using pl.ParallelLoader, one needs to call per_device_loader() for every epoch, otherwise the iterator is empty. See https://github.com/pytorch/xla/commit/54f3e16a72be3f5e706aad0022df37e3a69e91c4. Could you try replacing that usage and see if it works?

IamSparky commented 4 years ago

Sorry, I don't understand; the notebook shows it trained for 20+ epochs but the accuracy is 0.

Reading the code, I believe the culprit is the usage of pl.ParallelLoader instead of pl.MpDeviceLoader. If using pl.ParallelLoader, one needs to call per_device_loader() for every epoch, otherwise the iterator is empty. See 54f3e16. Could you try replacing that usage and see if it works?

@taylanbil I am using per_device_loader() with pl.ParallelLoader only .. its there in my notebook .. but when I am using pl.MpDeviceLoader inplace of pl.ParallelLoader its giving me Import error . No Module found MpDeviceLoader in TORCH_XLA.DISTRIBUTED.PARALLEL_LOADER . Notebook link

taylanbil commented 4 years ago

.per_device_loader() needs to be called at every epoch. If I am not mistaken, the code uses it only once and does 20+ epochs. there should be 20+ calls to it.

IamSparky commented 4 years ago

.per_device_loader() needs to be called at every epoch. If I am not mistaken, the code uses it only once and does 20+ epochs. there should be 20+ calls to it.

Thanks @taylanbil I changed the code a little bit and it started working . Really a big Thanks for the help . I am very new in using pytorch with xla . But the kaggle kernel is taking a lot of time in training 51000 images with a batch size of 256 sized at 224 x 224 , more than hour for 25 epochs. Is it the correct speed? Notebook Link

JackCaoG commented 4 years ago

Hi @soumochatterjee , could you follow the instruction in https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb and do a debug run. It would be a lot easier to debug the speed issue if we have the debugging output.

zcain117 commented 4 years ago

How big is the dataset? These sound like pretty large images and pretty large batch size and Kaggle machines have very low RAM and CPU cores, so it's likely that the VM is your bottleneck since it needs to feed images to the TPU. Like Jack said, it's good to have those debug metrics.

Another way to see if the image loading is the problem is to try running the model with generated data (fake data) rather than loading real images from disk. See here for an example

IamSparky commented 4 years ago

I am Training 51000 images with a batch size of 256 sized at 224 x 224 , for 25 epochs.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.