Open axhero7 opened 1 month ago
Just adding, I wrote some print statements was able to figure out that it happens explicitly on the loss.backward() call, could be helpful.
@wonjoolee95 can you take a look since you are offcall this week?
Apologies for the late reply, as I've been away last week.
Hmm, nothing obviously wrong stands out to me looking at the code. @axhero7, can you dump the IR and HLO following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md? I expect the dump to be large, so you can dump the IR/HLO right before the code starts to hang at loss.backward().
🐛 Bug
I am fine tuning a transformers model - openai whisper - and using pytorch to do the training. I am training using a google cloud TPU v4, and it freezes with zero errors at iteration 34. The model has been able to train for one full epoch (extremely slowly) on a google colab GPU, but want to train it for larger epochs, and batch size and learning using TPUs.
I've followed steps with #3203 , #2749 , and #1562, all fixes did not resolve my issues.
To Reproduce
Link to the Github
Environment
Let me know any information I can provide, thank you for helping!