ALBERT-xxlarge V2 training on TPU V3-512 extremely slow

agemagician commented 4 years ago

Hello,

We are training a bioinformatics data using ALBERT-xxlarge on TPU V3-512.

According to the paper you trained "ALBERT-xxlarge" for 125k in 32h.

However, our training will take 7 days to complete 130k.

Our vocab file is only 34 and this is our training command:

python -m albert.run_pretraining \
    --input_file=gs://...../_train_*.tfrecord \
    --output_dir=gs:/....../albert_model/ \
    --albert_config_file=/......../albert-xxlarge-v2-config.json \
    --do_train \
    --do_eval \
    --train_batch_size=10240 \
    --eval_batch_size=64 \
    --max_seq_length=512 \
    --max_predictions_per_seq=20 \
    --optimizer='lamb' \
    --learning_rate=.002 \
    --iterations_per_loop=100 \
    --num_train_steps=130000 \
    --num_warmup_steps=42000 \
    --save_checkpoints_steps=1000 \
    --use_tpu=TRUE \
    --num_tpu_cores=512 \
    --tpu_name=.....

I also tried to change the "iterations_per_loop" to 1000 or even bigger but that didn't help.

The current logs from the training is :

I0407 18:04:44.154831 140197639472896 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
W0407 18:04:44.230242 140197639472896 preempted_hook.py:91] TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
INFO:tensorflow:Outfeed finished for iteration (8, 70)
I0407 18:05:06.949739 140197647865600 tpu_estimator.py:279] Outfeed finished for iteration (8, 70)
I0407 18:05:14.312140 140197639472896 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
W0407 18:05:14.393373 140197639472896 preempted_hook.py:91] TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
I0407 18:05:44.470578 140197639472896 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
W0407 18:05:44.566381 140197639472896 preempted_hook.py:91] TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
INFO:tensorflow:Outfeed finished for iteration (8, 84)
I0407 18:06:08.473748 140197647865600 tpu_estimator.py:279] Outfeed finished for iteration (8, 84)
I0407 18:06:14.650656 140197639472896 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
W0407 18:06:14.725901 140197639472896 preempted_hook.py:91] TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
I0407 18:06:44.819700 140197639472896 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
W0407 18:06:44.902827 140197639472896 preempted_hook.py:91] TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
INFO:tensorflow:Outfeed finished for iteration (8, 98)
I0407 18:07:09.999137 140197647865600 tpu_estimator.py:279] Outfeed finished for iteration (8, 98)
I0407 18:07:14.984425 140197639472896 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
W0407 18:07:15.060185 140197639472896 preempted_hook.py:91] TPUPollingThread found TPU b'node-6' in state READY, and health HEALTHY.
INFO:tensorflow:loss = 3.1807582, step = 900 (463.227 sec)
I0407 18:07:18.708591 140198823081728 basic_session_run_hooks.py:260] loss = 3.1807582, step = 900 (463.227 sec)
INFO:tensorflow:global_step/sec: 0.215877
I0407 18:07:18.709693 140198823081728 tpu_estimator.py:2307] global_step/sec: 0.215877
INFO:tensorflow:examples/sec: 2210.58
I0407 18:07:18.709883 140198823081728 tpu_estimator.py:2308] examples/sec: 2210.58
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.

It takes around 463 seconds per 100 steps, which means we can train 130k in 7 days. (130000 / 100 ) * 463 = 601900 seconds = 7 days.

The tpu, server and the bucket all at the same region.

In SUMMIT (world fastest computer) I was able to train bert with 30 layers and it took only 24 hours to finish around 122k steps using 6k V100 GPUs with Global batch size of 11k.

Do you have any idea why we can't reproduce the same speed as the paper ?

@0x0539 @Danny-Google Your feedback will be highly appreciated

agemagician commented 4 years ago

I have used TPU profiler to get more insight. Here is a screenshot from op_profiler:

Overview:

It seems the TPU is well utilized.

Danny-Google commented 4 years ago

Hi Ahmed,

Can you try a batch_size of 4096? Increasing the save_checkpoints_steps should also be helpful. We use save_checkpoints_steps=5000.

agemagician commented 4 years ago

Hi @Danny-Google,

Thanks a lot for your quick reply.

As far as I know that increasing the batch size for the TPU is recommended, rather than decreasing it. As long as it is multiple of 64. I will test decreasing it to 4k as you recommended.
I will increase the save_checkpoints_steps, but this will only reduce the overhead when the training script saves the model, it doesn't affect the amount of time required to finish 100 steps.

Could you please share with us the following information for the ALBERT-xxlarge training:

global_step/sec.
examples/sec.
time per second for finishing 100 steps.

Danny-Google commented 4 years ago

Yes, It is recommenced to increase the batch_size but not if you want to increase the speed for each step.

Here are my numbers.

global_step/sec: ~1.092
examples/sec: ~4468
time to finish 125k steps: 1d7h20m28s.

agemagician commented 4 years ago

I have test it with 4k as you recommend but still can't match up with your numbers.

Here are the logs:

INFO:tensorflow:loss = 3.1838515, step = 200 (191.310 sec)
I0407 23:44:33.365083 140299591386880 basic_session_run_hooks.py:260] loss = 3.1838515, step = 200 (191.310 sec)
INFO:tensorflow:global_step/sec: 0.522713
I0407 23:44:33.366309 140299591386880 tpu_estimator.py:2307] global_step/sec: 0.522713
INFO:tensorflow:examples/sec: 2141.03
I0407 23:44:34.340968 140299591386880 tpu_estimator.py:2308] examples/sec: 2141.03
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.

ofcourse the global step per second did increase, but I have the same examples per seconds as 10k. Nothing did change.

Your reported examples/sec is more than double of mine.

agemagician commented 4 years ago

I am using tf 1.15.2 and the TPU pod version is also 1.15.2.

This is our json file:

{
  "attention_probs_dropout_prob": 0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0,
  "embedding_size": 128,
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 512,
  "num_attention_heads": 64,
  "num_hidden_layers": 12,
  "num_hidden_groups": 1,
  "net_structure_type": 0,
  "layers_to_keep": [],
  "gap_size": 0,
  "num_memory_blocks": 0,
  "inner_group_num": 1,
  "down_scale_factor": 1,
  "type_vocab_size": 2,
  "vocab_size": 34
}

Could this be a side effect of having a small vocab in Albert ?

agemagician commented 4 years ago

I have test the model with embedding_size 34 and 16 but it didn't had any effect on the speed.

Danny-Google commented 4 years ago

That is indeed strange! my iterations_per_loop is 1000, which you have already tried it. Other than that, I couldn't find any differences that will cause your example/sec to be half of what I had.

Danny-Google commented 4 years ago

Especially when your tpu usage is high....

Danny-Google commented 4 years ago

By the way, those numbers I reported was when the dropout_rate = 0.1, if we set it to be 0, the global_step/sec is roughly 1.197 and examples/sec is roughly 4913.

agemagician commented 4 years ago

This makes my examples/sec ratio is even worse: 2141 compared to 4913.

This is indeed looks strange, the only difference in our use-case are:

The vocabulary size is much smaller (34).
The total tfrecords size are much larger (1TB) with dupe factor of 2 only, and the number of tfrecords is much larger (6k files), each tfrecord has 72000 samples.

For testing previous changes:

I tried to change the embedding_size (E parameter) 128,34 and 16, but it didn't had effect.
The profiler reported that the training is not input-bound.

Is there anyone that you know who could help us figuring out why we have lower throughput ? I hoped that we will get better throughput using the TPU POD compared to SUMMIT.

Thanks again @Danny-Google for taking this much time to help us.

agemagician commented 4 years ago

One thing that came to my mind, Did you use bfloat16 for training the model ? I didn't find any thing in the code related to explicitly using bfloat16.

sultanovazamat commented 4 years ago

Hi! @agemagician Could you please share some info about the process of pretraining data? How much time and memory has the create_pretraining.py consumed on your dataset? And what is the size of your dataset? Thanks!

guotong1988 commented 4 years ago

May I ask where did you get the TPUs? Thank you very much.

Danny-Google commented 4 years ago

@agemagician, sorry for the late reply, got super busy recently. About using bfloat16, I don't remember we do that explicitly. Plus, according to here (https://cloud.google.com/tpu/docs/bfloat16), using bfloat16 will not make your running time half.

google-research / albert

ALBERT-xxlarge V2 training on TPU V3-512 extremely slow #198