Closed lai-agent-t closed 4 years ago
I totally agree with you that the transformers team needs to address this issue from a long time ago. I am also struggle to run token classification using TPU. Google gives TPUv3-8 as a part of google collab for only 9$ which equivalent to 8xV100 GPU. Yet until now, we can't run transformers using TPU. This should be a top priority for the transformers team. at least we need only one running example using token classification NER. I managed to do it using XLA but its nowhere near TPU performance.
Hi, thank you for opening this issue.
@lai-agent-t, did you complete training on the TPU, or did you stop beforehand? If you stopped, was the tokenization process already finished?
@NLPPower, three NER scripts are available in this repository: NER with Trainer, with TFTrainer, and with Pytorch Lightning. All three support TPU. Did you get bad performance/slow training when using those scripts?
I'm NOT stop beforehand, I updated the num_train_epochs latter to 10 and I trained 6 epochs and it takes me almost 2 hours with only 3000 sentences
I see, thanks. In your TPU environment, do you mind running the following (please make sure you have transformers installed from source)?
from transformers.file_utils import is_torch_tpu_available
print(is_torch_tpu_available())
Thank you!
Hi, thank you for opening this issue.
@lai-agent-t, did you complete training on the TPU, or did you stop beforehand? If you stopped, was the tokenization process already finished?
@NLPPower, three NER scripts are available in this repository: NER with Trainer, with TFTrainer, and with Pytorch Lightning. All three support TPU. Did you get bad performance/slow training when using those scripts?
I struggled to run NER classifier using ALBERT model in TPU using TensorFlow . XLA with PyTorch will not give you a great performance compared to pure TF. Plus it doesn't support fp16 which could cut the fine-tuning time by 4x times . I tested fp16 using v100 and i was able to exceed the performance of PyTorch using TPU where i used docker and TF nightly. to confirm my finding please have a look at the Performance Evaluation table at the bottom of this page. https://github.com/allenai/tpu_pretrain you can see that TPU in TF is almost 4x-6x faster than Pytorhc + XLA in TPU. If you can just create a simple example in google colab where transformer was able to run in TPU in TF for token classification task ( NER ) i will be more than happy, because i struggled to do it since two weeks and there is also couple of folks here who struggled to do it. This should be high priority for transformer team because TPU access can give researcher a powerful resource for almost free using kaggle and google colab. Please have a look also at this project which is the closest thing i could find to run NER in TPU using distributed strategy in top of keras. https://github.com/kyzhouhzau/NLPGNN/tree/master/tests/NER/NER_EN
I'm sure my tork_tpu is available, because I test the example case you put on the tpu case: python examples/xla_spawn.py --num_cores 8 \ examples/text-classification/run_glue.py --model_name_or_path bert-base-cased \ --task_name mnli \ --data_dir ./data/glue_data/MNLI \ --output_dir ./models/tpu \ --overwrite_output_dir \ --do_train \ --do_eval \ --num_train_epochs 1 \ --save_steps 20000 it works without any error, but the Utilization of TPU Matrix Units (higher is better) is 5% and it stable
So, I'm feel confuse is run_language_model.py support TPU?
same here, is there any update?
same here, is there any update?
I have change to tensorflow 2.0 instead of pytorch ...
any updates on pytorch?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
❓ Questions & Help
I try to using own dataset on tpu with running run_language_model.py, the command is I use below: python examples/xla_spawn.py --num_cores 8 examples/language-modeling/run_language_modeling.py --model_name_or_path hfl/chinese-bert-wwm --output_dir model/tpu --train_data_file /Language_masked_model/data/toy_MLM_data.txt --line_by_line --mlm --block_size 512 --do_train --evaluate_during_training --per_device_train_batch_size 10 --tpu_num_cores 8 --debug --num_train_epochs 1 --save_steps 20000
No errors but I assume it not use TPU, I mentor the usage of TPU, get info below:
Cloud TPU Monitoring Results (Sample 20 ):
TPU type: TPU v3 Utilization of TPU Matrix Units (higher is better): 0.000%
Cloud TPU Monitoring Results (Sample 21 ):
TPU type: TPU v3 Utilization of TPU Matrix Units (higher is better): 0.000%
Cloud TPU Monitoring Results (Sample 22 ):
TPU type: TPU v3 Number of TPU cores: 1 (Replica count = 8, num cores per replica = 1) TPU idle time (lower is better): 0.027% Utilization of TPU Matrix Units (higher is better): 0.039% Step time: 11.1ms (avg), 11.1ms (min), 11.1ms (max) Infeed percentage: 0.000% (avg), 0.000% (min), 0.000% (max)
Cloud TPU Monitoring Results (Sample 23 ):
TPU type: TPU v3 Utilization of TPU Matrix Units (higher is better): 0.000%
Cloud TPU Monitoring Results (Sample 24 ):
TPU type: TPU v3 Utilization of TPU Matrix Units (higher is better): 0.000%
My question is did run_language_model.py support TPU?
tpu: V3.8 on Google Cloud Platform tensorflow==2.2.0 torch==1.7.0a0+12b5bdc
torch-xla==1.6+5430aca
I use offical docker on XLA (gcr.io/tpu-pytorch/xla:nightly_3.6) repo