Error in PyTorch Trainer when used with TPU

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

129.82k stars 25.79k forks source link

Error in PyTorch Trainer when used with TPU #6450

Closed M-Salti closed 3 years ago

M-Salti commented 3 years ago

Environment info

transformers version: 3.0.2
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.0a0+d6149a7 (False)
Tensorflow version (GPU?): 2.3.0 (False)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@sgugger ## Information Model I am using (Bert, XLNet ...): BERT The problem arises when using: * [ ] the official example scripts: (give details below) * [x] my own modified scripts: (give details below) The tasks I am working on is: * [x] an official GLUE/SQUaD task: SQUaD * [ ] my own task or dataset: (give details below) The following error arises when using the `run_squad_trainer.py` script with TPU: ```python Epoch: 0% 0/2 [00:00

Expected behavior

The script runs and trains the model

utkuumur commented 3 years ago

I am receiving the same error. Even without using TPU.

python run_glue.py --model_name_or_path bert-base-cased --task_name MRPC --do_train --do_eval --data_dir $GLUE_DIR/MRPC/ --max_seq_length 128 --per_device_train_batch_size --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/mrpc_output/

AliOsm commented 3 years ago

Try with the following:

!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version "nightly"

!pip install git+https://github.com/huggingface/transformers.git

!git clone https://github.com/huggingface/transformers.git

!python transformers/examples/xla_spawn.py --num_cores 1 \
    question-answering/run_squad_trainer.py \
    --model_name_or_path bert-base-multilingual-cased \
    --model_type bert \
    --data_dir $DATA_DIR \
    --do_train \
    --per_device_train_batch_size 64 \
    --learning_rate 3e-5 \
    --num_train_epochs 2.0 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir $OUT_DIR \
    --overwrite_output_dir

To run it with all 8 TPU cores, you most likely need the 35GB RAM runtime from Google Colab. You can find it in this notebook.

M-Salti commented 3 years ago

Thanks @AliOsm, it works!