kamalkraj / ALBERT-TF2.0

ALBERT model Pretraining and Fine Tuning using TF2.0
Apache License 2.0
199 stars 45 forks source link

Multi GPU finetuning #19

Closed steindor closed 4 years ago

steindor commented 4 years ago

I have a question regarding your experiment finetuning for SQuAd 2.0 with 4x Titan RTX 24 GB. How long was the total training time? I´m running the same experiment with 8x Tesla V100 16 GB which according to my calculations takes about 200 hrs. I was expecting much lower training time with 8 GPUs.

python albert-tf2/run_squad.py \ --mode=train_and_predict \ --input_meta_data_path=${OUTPUTDIR}/squad${SQUAD_VERSION}_meta_data \ --train_data_path=${OUTPUTDIR}/squad${SQUAD_VERSION}_train.tf_record \ --predict_file=${SQUAD_DIR}/dev-${SQUAD_VERSION}.json \ --albert_config_file=${ALBERT_DIR}/config.json \ --init_checkpoint=${ALBERT_DIR}/tf2_model.h5 \ --spm_model_file=${ALBERT_DIR}/30k-clean.model \ --train_batch_size=32 \ --predict_batch_size=32 \ --learning_rate=1.5e-5 \ --num_train_epochs=3 \ --model_dir=${OUTPUT_DIR} \ --strategy_type=mirror \ --version_2_with_negative \ --max_seq_length=384

Thanks in advance!

kamalkraj commented 4 years ago

@steindor ~20 hours on 4 x TITAN RTX 24GB. (including preprocessing data) same parameters from the Readme.

steindor commented 4 years ago

Thanks for the reply. So that would be around 17 hrs w/o preprocessing? That makes much more sense. To activate multi-gpu training, the only adjustment to the code one needs to make is --strategy_type=mirror flag, right? Just want to be sure I'm not missing something here :)

kamalkraj commented 4 years ago

@steindor You're correct. After running the code, you can check GPU usage by nvidia-smi

steindor commented 4 years ago

Yes exactly. Thanks for the help.