run_mlm.py not utilizing TPU

Environment info

transformers version: 4.3.2 and Latest version forked from github
Platform: Linux (Colab env)
Python version: 3.6
PyTorch version (GPU?): XLA 1.7
Tensorflow version (GPU?): None
Using GPU in script?: No
Using distributed or parallel set-up in script?: Colab TPU with xla_spawn.py

Who can help

@sgugger

Information

Model I am using (Bert, XLNet ...): DistilBert

The problem arises when using:

[ ] the official example scripts: (give details below)
[X] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

!python /content/transformers/examples/xla_spawn.py --num_cores 8 /content/transformers/examples/language-modeling/run_mlm.py \
--model_type distilbert \
--config_name /content/TokenizerFiles \
--tokenizer_name /content/TokenizerFiles \
--train_file Files/file_aa.txt \
--mlm_probability 0.15 \
--output_dir "/content/TrainingCheckpoints" \
--do_train --per_device_train_batch_size 32 \
--save_steps 500 --disable_tqdm False \
--line_by_line True --max_seq_length 150 \
--pad_to_max_length False \
--cache_dir /content/cache_dir \
--save_total_limit 2

My tokenizer and config files are both just {model_type: "distilbert"} and are present in TokenizerFiles folder along with my vocab.txt

The output I get is

WARNING:root:TPU has started up successfully with version pytorch-1.7
2021-02-15 14:40:37.816883: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
WARNING:root:TPU has started up successfully with version pytorch-1.7
2021-02-15 14:40:57.239070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 14:40:57.283838: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 14:40:57.446951: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 14:40:57.470266: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 14:40:57.473336: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 14:40:57.686903: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 14:40:57.863940: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 14:40:58.555214: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
WARNING:run_mlm:Process rank: -1, device: xla:1, n_gpu: 0distributed training: False, 16-bits training: False
INFO:run_mlm:Training/evaluation parameters TrainingArguments(output_dir=/content/TrainingCheckpoints, overwrite_output_dir=False, do_train=True, do_eval=None, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Feb15_14-41-21_34a4105ebd5a, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=2, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=8, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/content/TrainingCheckpoints, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, _n_gpu=0)
Using custom data configuration default
Downloading and preparing dataset text/default-e939092a7eff14a8 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab...
02/15/2021 14:41:22 - WARNING - run_mlm -   Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab. Subsequent calls will reuse this data.
[INFO|configuration_utils.py:447] 2021-02-15 14:41:22,465 >> loading configuration file /content/TokenizerFiles/config.json
[INFO|configuration_utils.py:485] 2021-02-15 14:41:22,466 >> Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.3.2",
  "vocab_size": 30522
}

[INFO|configuration_utils.py:447] 2021-02-15 14:41:22,467 >> loading configuration file /content/TokenizerFiles/config.json
[INFO|configuration_utils.py:485] 2021-02-15 14:41:22,476 >> Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.3.2",
  "vocab_size": 30522
}

[INFO|tokenization_utils_base.py:1688] 2021-02-15 14:41:22,476 >> Model name '/content/TokenizerFiles' not found in model shortcut name list (distilbert-base-uncased, distilbert-base-uncased-distilled-squad, distilbert-base-cased, distilbert-base-cased-distilled-squad, distilbert-base-german-cased, distilbert-base-multilingual-cased). Assuming '/content/TokenizerFiles' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1721] 2021-02-15 14:41:22,477 >> Didn't find file /content/TokenizerFiles/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-02-15 14:41:22,478 >> Didn't find file /content/TokenizerFiles/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-02-15 14:41:22,478 >> Didn't find file /content/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1784] 2021-02-15 14:41:22,479 >> loading file /content/TokenizerFiles/vocab.txt
[INFO|tokenization_utils_base.py:1784] 2021-02-15 14:41:22,479 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-02-15 14:41:22,480 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-02-15 14:41:22,480 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-02-15 14:41:22,480 >> loading file /content/TokenizerFiles/tokenizer_config.json
INFO:run_mlm:Training new model from scratch
Using custom data configuration default
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
02/15/2021 14:41:22 - WARNING - run_mlm -   Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
Using custom data configuration default
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
02/15/2021 14:41:23 - WARNING - run_mlm -   Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
Using custom data configuration default
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
02/15/2021 14:41:23 - WARNING - run_mlm -   Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
02/15/2021 14:41:23 - WARNING - run_mlm -   Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
02/15/2021 14:41:23 - WARNING - run_mlm -   Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
Using custom data configuration default
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
Using custom data configuration default
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
Using custom data configuration default
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
02/15/2021 14:41:24 - WARNING - run_mlm -   Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
Using custom data configuration default
Reusing dataset text (/root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
100% 2/2 [00:01<00:00,  1.72ba/s]
100% 2/2 [00:01<00:00,  1.65ba/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-0028d6bfc2eb6117.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-0028d6bfc2eb6117.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-0028d6bfc2eb6117.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-0028d6bfc2eb6117.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-0028d6bfc2eb6117.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/text/default-e939092a7eff14a8/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-0028d6bfc2eb6117.arrow
[INFO|trainer.py:432] 2021-02-15 14:41:59,875 >> The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: special_tokens_mask.
[INFO|trainer.py:837] 2021-02-15 14:41:59,879 >> ***** Running training *****
[INFO|trainer.py:838] 2021-02-15 14:41:59,879 >>   Num examples = 2000
[INFO|trainer.py:839] 2021-02-15 14:41:59,879 >>   Num Epochs = 3
[INFO|trainer.py:840] 2021-02-15 14:41:59,879 >>   Instantaneous batch size per device = 32
[INFO|trainer.py:841] 2021-02-15 14:41:59,879 >>   Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:842] 2021-02-15 14:41:59,879 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-02-15 14:41:59,879 >>   Total optimization steps = 24

 17% 4/24 [03:56<17:13, 51.67s/it]  # <------------------- HERE ------------------------>
Traceback (most recent call last):

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

The file used here is only for testing and has a total of 2000 lines of text. It almost seems like the training is taking place on the CPU instead of the TPU. The installation of xla was done using !pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.7-cp36-cp36m-linux_x86_64.whl I ran the same script a couple of days back and it worked fine so I don't know what is wrong now. At that time I had saved the tokenizer using .save() but due to some recent changes in the library, that doesn't work anymore. So I saved it using save_model() and it works fine now. Can this issue be because of that?

Expected behavior

The training should be faster. The last time I ran run_mlm.py, I got almost 3 iterations per second

huggingface / transformers