DarshanDeshpande commented 3 years ago

Model I am using (Bert, XLNet ...): DistilBert

!python /content/transformers/examples/ --num_cores 8 /content/transformers/examples/language-modeling/ \
--model_type distilbert \
--config_name /content/TokenizerFiles \
--tokenizer_name /content/TokenizerFiles \
--train_file Files/file_aa.txt \
--mlm_probability 0.15 \
--output_dir "/content/TrainingCheckpoints" \
--do_train --per_device_train_batch_size 32 \
--save_steps 500 --disable_tqdm False \
--line_by_line True --max_seq_length 150 \
--pad_to_max_length False \
--cache_dir /content/cache_dir \
--save_total_limit 2

My tokenizer and config files are both just {model_type: "distilbert"} and are present in TokenizerFiles folder along with my vocab.txt

The output I get is

WARNING:run_mlm:Process rank: -1, device: xla:1, n_gpu: 0distributed training: False, 16-bits training: False
INFO:run_mlm:Training/evaluation parameters TrainingArguments(output_dir=/content/TrainingCheckpoints, overwrite_output_dir=False, do_train=True, do_eval=None, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Feb15_14-41-21_34a4105ebd5a, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=2, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=8, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/content/TrainingCheckpoints, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, _n_gpu=0)
[INFO|] 2021-02-15 14:41:22,465 >> loading configuration file /content/TokenizerFiles/config.json
[INFO|] 2021-02-15 14:41:22,466 >> Model config DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.3.2",
  "vocab_size": 30522

INFO:run_mlm:Training new model from scratch
[INFO|] 2021-02-15 14:41:59,875 >> The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: special_tokens_mask.
[INFO|] 2021-02-15 14:41:59,879 >> ***** Running training *****
[INFO|] 2021-02-15 14:41:59,879 >>   Num examples = 2000
[INFO|] 2021-02-15 14:41:59,879 >>   Num Epochs = 3
[INFO|] 2021-02-15 14:41:59,879 >>   Instantaneous batch size per device = 32
[INFO|] 2021-02-15 14:41:59,879 >>   Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|] 2021-02-15 14:41:59,879 >>   Gradient Accumulation steps = 1
[INFO|] 2021-02-15 14:41:59,879 >>   Total optimization steps = 24

 17% 4/24 [03:56<17:13, 51.67s/it]  # <------------------- HERE ------------------------>
The file used here is only for testing and has a total of 2000 lines of text. It almost seems like the training is taking place on the CPU instead of the TPU. The installation of xla was done using !pip install cloud-tpu-client==0.10 I ran the same script a couple of days back and it worked fine so I don't know what is wrong now. At that time I had saved the tokenizer using .save() but due to some recent changes in the library, that doesn't work anymore. So I saved it using save_model() and it works fine now. Can this issue be because of that?

Expected behavior

The training should be faster. The last time I ran, I got almost 3 iterations per second

sgugger commented 3 years ago

--pad_to_max_length False is the reason you have a very slow training: this creates batches of different sequence lengths but TPUs need fixed shapes to be efficient.

There was a bug in our argument parser before that ignored bool setting like this, so it may be the reason you are seeing that slow down now instead of before (but it was applying pad_to_max_length=True before because of that bug, even if you said the opposite). If you remove that option, you should see a faster training.

DarshanDeshpande commented 3 years ago

Perfect! Thank you so much! Closing this issue