Open seanswyi opened 6 months ago
After doing some research it seems like xm.optimizer_step(optimizer)
is only to be used with multi-device settings and if I only want to use one device (as I'm doing now) then I have to use xm.mark_step()
.
I'm still curious why there's such a huge difference in terms of memory though.
Can you follow https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#pytorchxla--dynamo-debugging-tool to do a quick debug run with PT_XLA_DEBUG=1
? What we expected is that HLO only captures a single step of your training loop. If you tried adding mark_step
after optimizer.step
but still see this error and PT_XLA_DEBUG=1
is not too helpful, you can try to dump the IR or HLO following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#common-debugging-environment-variables-combinations and share with us.
❓ Questions and Help
I'm trying to run a simple text classification task using HuggingFace Transformers and BERT. My background's in NLP but I wanted to run a simple tutorial to get used to using TPUs rather than GPUs. The tutorial is this: Fine-tune a pretrained model.
The code can fit into one script:
The main part of the error message looks like this:
What I don't understand is that the
bert-base-cased
model with a batch size of 6 usually doesn't even take up 10GB of memory on a GPU. Am I doing something wrong w.r.t. changing my code for TPU usage?