Closed IamSparky closed 4 years ago
Looking @ the traceback the problem is in this line:
running_corrects += torch.sum(preds == labels.data)
Seems like the dimensions don't match in preds
and labels.data
.
@taylanbil I modified the code but it starts to break in the 2nd epoch , can not able to trace the error.
https://www.kaggle.com/soumochatterjee/cutmix-flower-classification
same error or different?
Now , error is not coming but training stops on second iteration please help.. its difficult to debug
Sorry, I don't understand; the notebook shows it trained for 20+ epochs but the accuracy is 0.
Reading the code, I believe the culprit is the usage of pl.ParallelLoader
instead of pl.MpDeviceLoader
. If using pl.ParallelLoader
, one needs to call per_device_loader()
for every epoch, otherwise the iterator is empty. See https://github.com/pytorch/xla/commit/54f3e16a72be3f5e706aad0022df37e3a69e91c4. Could you try replacing that usage and see if it works?
Sorry, I don't understand; the notebook shows it trained for 20+ epochs but the accuracy is 0.
Reading the code, I believe the culprit is the usage of
pl.ParallelLoader
instead ofpl.MpDeviceLoader
. If usingpl.ParallelLoader
, one needs to callper_device_loader()
for every epoch, otherwise the iterator is empty. See 54f3e16. Could you try replacing that usage and see if it works?
@taylanbil I am using per_device_loader() with pl.ParallelLoader only .. its there in my notebook .. but when I am using pl.MpDeviceLoader inplace of pl.ParallelLoader its giving me Import error . No Module found MpDeviceLoader in TORCH_XLA.DISTRIBUTED.PARALLEL_LOADER . Notebook link
.per_device_loader() needs to be called at every epoch. If I am not mistaken, the code uses it only once and does 20+ epochs. there should be 20+ calls to it.
.per_device_loader() needs to be called at every epoch. If I am not mistaken, the code uses it only once and does 20+ epochs. there should be 20+ calls to it.
Thanks @taylanbil I changed the code a little bit and it started working . Really a big Thanks for the help . I am very new in using pytorch with xla . But the kaggle kernel is taking a lot of time in training 51000 images with a batch size of 256 sized at 224 x 224 , more than hour for 25 epochs. Is it the correct speed? Notebook Link
Hi @soumochatterjee , could you follow the instruction in https://github.com/pytorch/xla/blob/master/contrib/colab/issue-report.ipynb and do a debug run. It would be a lot easier to debug the speed issue if we have the debugging output.
How big is the dataset? These sound like pretty large images and pretty large batch size and Kaggle machines have very low RAM and CPU cores, so it's likely that the VM is your bottleneck since it needs to feed images to the TPU. Like Jack said, it's good to have those debug metrics.
Another way to see if the image loading is the problem is to try running the model with generated data (fake data) rather than loading real images from disk. See here for an example
I am Training 51000 images with a batch size of 256 sized at 224 x 224 , for 25 epochs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Issue description
Facing the issue Exception in device=TPU:0: torch_xla/csrc/helpers.cpp:510 while training in kaggle TPU
Code example
Here's the link to my notebook https://www.kaggle.com/soumochatterjee/cutmix-flower-classification