huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.23k stars 26.34k forks source link

ConvBertForQuestionAnswering hangs on 8x TPU cores using PyTorch / XLA #14273

Closed hlynurd closed 2 years ago

hlynurd commented 2 years ago

Environment info

Python 3.7.3 torch==1.9.1 torch-xla @ https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl transformers==4.12.3

Models:

Information

Hi all,

I would like to use ConvBertForQuestionAnswering on 8x tpu cores using pytorch/xla. It works for me on a single core, and changing the ConvBert model creation to Electra or Roberta works fine on both 1x and 8x cores.

import torch_xla.distributed.xla_multiprocessing as xmp
...
xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=8, start_method='fork')

This hangs for me when nprocs=8 but not when nprocs=1. It stops at a forward pass of the model

self.backbone = ConvBertForQuestionAnswering.from_pretrained(model_path, config=config) outputs = self.backbone.convbert(input_ids, attention_mask, token_type_ids)

LysandreJik commented 2 years ago

Hello! Could you show the script that you're using? Have you tried using accelerate or the Trainer, and does it fix any issue? Thank you, cc @sgugger

hlynurd commented 2 years ago

Here's my script: https://gist.github.com/hlynurd/d9b43edbb1b318e666ff875258130bb5 I get the same problem if I adapt it for accelerate

sgugger commented 2 years ago

And is the problem specific to ConvBert or do you have the same issue for all other models?

hlynurd commented 2 years ago

I only get the problem for ConvBert and only on 8 TPU cores. I have tried Electra and RoBERTa and both work well for 1 and 8 cores.

sgugger commented 2 years ago

Sounds like a specific problem in convBERT then. Not sure anyone on the team will have time to investigate in depth in the coming weeks however, but if you manage to find the cause, please let us now.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

hlynurd commented 2 years ago

This issue persists for me.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

hlynurd commented 2 years ago

This issue persists for me.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.