TPU Freezing on loss.backward() on same epoch.

axhero7 commented 1 month ago

🐛 Bug

I am fine tuning a transformers model - openai whisper - and using pytorch to do the training. I am training using a google cloud TPU v4, and it freezes with zero errors at iteration 34. The model has been able to train for one full epoch (extremely slowly) on a google colab GPU, but want to train it for larger epochs, and batch size and learning using TPUs.

I've followed steps with #3203 , #2749 , and #1562, all fixes did not resolve my issues.

To Reproduce

Link to the Github

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]:
torch_xla version:2.3.0

Let me know any information I can provide, thank you for helping!

axhero7 commented 1 month ago

Just adding, I wrote some print statements was able to figure out that it happens explicitly on the loss.backward() call, could be helpful.

JackCaoG commented 1 month ago

@wonjoolee95 can you take a look since you are offcall this week?

wonjoolee95 commented 3 weeks ago

Apologies for the late reply, as I've been away last week.

Hmm, nothing obviously wrong stands out to me looking at the code. @axhero7, can you dump the IR and HLO following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md? I expect the dump to be large, so you can dump the IR/HLO right before the code starts to hang at loss.backward().

pytorch / xla