Closed M-Salti closed 3 years ago
I am receiving the same error. Even without using TPU.
python run_glue.py --model_name_or_path bert-base-cased --task_name MRPC --do_train --do_eval --data_dir $GLUE_DIR/MRPC/ --max_seq_length 128 --per_device_train_batch_size --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/mrpc_output/
Try with the following:
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version "nightly"
!pip install git+https://github.com/huggingface/transformers.git
!git clone https://github.com/huggingface/transformers.git
!python transformers/examples/xla_spawn.py --num_cores 1 \
question-answering/run_squad_trainer.py \
--model_name_or_path bert-base-multilingual-cased \
--model_type bert \
--data_dir $DATA_DIR \
--do_train \
--per_device_train_batch_size 64 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir $OUT_DIR \
--overwrite_output_dir
To run it with all 8 TPU
cores, you most likely need the 35GB RAM
runtime from Google Colab. You can find it in this notebook.
Thanks @AliOsm, it works!
Environment info
transformers
version: 3.0.2Who can help
@sgugger ## Information Model I am using (Bert, XLNet ...): BERT The problem arises when using: * [ ] the official example scripts: (give details below) * [x] my own modified scripts: (give details below) The tasks I am working on is: * [x] an official GLUE/SQUaD task: SQUaD * [ ] my own task or dataset: (give details below) The following error arises when using the `run_squad_trainer.py` script with TPU: ```python Epoch: 0% 0/2 [00:00, ?it/s] Iteration: 0it [00:00, ?it/s]Exception in device=TPU:0: 'NoneType' object cannot be interpreted as an integer Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn fn(gindex, *args) File "/content/transformers/examples/question-answering/run_squad_trainer.py", line 156, in _mp_fn main() File "/content/transformers/examples/question-answering/run_squad_trainer.py", line 145, in main model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 584, in train self.epoch = epoch + (step + 1) / len(epoch_iterator) TypeError: 'NoneType' object cannot be interpreted as an integer ``` ## To reproduce Steps to reproduce the behavior: 1. install transformers from the master branch 2. install pytorch-xla using the following command: ```shell VERSION = "20200325" curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py python pytorch-xla-env-setup.py --version $VERSION ``` 3. run the training script (I'm using 1 tpu core merely to simplify the logs. The error is the same (for each core) when using 8 cores): ```shell cd transformers/examples/ python ./xla_spawn.py --num_cores 1 \ question-answering/run_squad_trainer.py \ --model_name_or_path bert-base-multilingual-cased \ --model_type bert \ --data_dir $DATA_DIR \ --do_train \ --per_device_train_batch_size 64 \ --learning_rate 3e-5 \ --num_train_epochs 2.0 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir $OUT_DIR \ --overwrite_output_dir ```Expected behavior
The script runs and trains the model