google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.23k stars 570 forks source link

GPU not used when fine-tuning on SQuAD 2.0 #194

Open Inception95 opened 4 years ago

Inception95 commented 4 years ago

Hi, thanks for the great contribution!

I am trying to fine-tune on SQuAD 2.0 using run_squad_v2.py but the GPU is not used: python -m albert.run_squad_v2 --albert_config_file=albert/albert_base/albert_config.json --output_dir=albert/outputs --train_file=albert/train-v2.0.json --predict_file=albert/dev-v2.0.json --vocab_file=albert/albert_base/30k-clean.vocab --train_feature_file=train_feature_file.tf --predict_feature_file=predict_feature_file.tf --predict_feature_left_file=predict_left_feature_file.tf --init_checkpoint=albert/albert_base/model.ckpt-best.index --spm_model_file=albert/albert_base/30k-clean.model --do_lower_case --max_seq_length=384 --doc_stride=128 --max_query_length=64 --do_train --do_predict --train_batch_size=48 --predict_batch_size=8 --learning_rate=5e-5 --num_train_epochs=5.0 --warmup_proportion=.1 --save_checkpoints_steps=5000 --n_best_size=20 --max_answer_length=30

The examples/sec: 0.389458 and nvidia-smi shows the GPU usage is 0%.

I also find the "use_tpu" flag on run_squad_v2.py, but it shows the option is TPU or "GPU/CPU". Does the "GPU/CPU" have priority of GPU or CPU?

OS: Ubuntu 16.04 GPU: K80 TensorFlow: 1.15.2

Does anyone know the possible solutions to this situation? Thanks in advance.

lixiangtnt commented 4 years ago

1.15.2 is cpu only version. try tensorflow-gpu==1.15

YuHengKit commented 4 years ago

@lixiangtnt is this mean uninstall tensorflow 1.15.2 and use tensorflow-gpu library only? appreciate ur help.

YuHengKit commented 4 years ago

I solved this problem with conda install tensorflow-gpu==1.15. pip install tensorflow-gpu==1.15 doen't work as expected.

PremalMatalia commented 3 years ago

@YuHengKit - Is it possible to provide your notebook file which ran successfully for Albert fine-tuning for SQuAD 2.0 dataset? I am facing issue of training is not getting started and stopped abruptly after warm up steps without any error.

Appreciate your help in advance.