Closed agemagician closed 4 years ago
I have used TPU profiler to get more insight. Here is a screenshot from op_profiler:
Overview:
It seems the TPU is well utilized.
Hi Ahmed,
Can you try a batch_size of 4096? Increasing the save_checkpoints_steps should also be helpful. We use save_checkpoints_steps=5000.
Hi @Danny-Google,
Thanks a lot for your quick reply.
As far as I know that increasing the batch size for the TPU is recommended, rather than decreasing it. As long as it is multiple of 64. I will test decreasing it to 4k as you recommended.
I will increase the save_checkpoints_steps, but this will only reduce the overhead when the training script saves the model, it doesn't affect the amount of time required to finish 100 steps.
Could you please share with us the following information for the ALBERT-xxlarge training:
Yes, It is recommenced to increase the batch_size but not if you want to increase the speed for each step.
Here are my numbers.
I have test it with 4k as you recommend but still can't match up with your numbers.
Here are the logs:
INFO:tensorflow:loss = 3.1838515, step = 200 (191.310 sec)
I0407 23:44:33.365083 140299591386880 basic_session_run_hooks.py:260] loss = 3.1838515, step = 200 (191.310 sec)
INFO:tensorflow:global_step/sec: 0.522713
I0407 23:44:33.366309 140299591386880 tpu_estimator.py:2307] global_step/sec: 0.522713
INFO:tensorflow:examples/sec: 2141.03
I0407 23:44:34.340968 140299591386880 tpu_estimator.py:2308] examples/sec: 2141.03
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
ofcourse the global step per second did increase, but I have the same examples per seconds as 10k. Nothing did change.
Your reported examples/sec is more than double of mine.
I am using tf 1.15.2 and the TPU pod version is also 1.15.2.
This is our json file:
{
"attention_probs_dropout_prob": 0,
"hidden_act": "gelu",
"hidden_dropout_prob": 0,
"embedding_size": 128,
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 16384,
"max_position_embeddings": 512,
"num_attention_heads": 64,
"num_hidden_layers": 12,
"num_hidden_groups": 1,
"net_structure_type": 0,
"layers_to_keep": [],
"gap_size": 0,
"num_memory_blocks": 0,
"inner_group_num": 1,
"down_scale_factor": 1,
"type_vocab_size": 2,
"vocab_size": 34
}
Could this be a side effect of having a small vocab in Albert ?
I have test the model with embedding_size 34 and 16 but it didn't had any effect on the speed.
That is indeed strange! my iterations_per_loop is 1000, which you have already tried it. Other than that, I couldn't find any differences that will cause your example/sec to be half of what I had.
Especially when your tpu usage is high....
By the way, those numbers I reported was when the dropout_rate = 0.1, if we set it to be 0, the global_step/sec is roughly 1.197 and examples/sec is roughly 4913.
This makes my examples/sec ratio is even worse: 2141 compared to 4913.
This is indeed looks strange, the only difference in our use-case are:
For testing previous changes:
Is there anyone that you know who could help us figuring out why we have lower throughput ? I hoped that we will get better throughput using the TPU POD compared to SUMMIT.
Thanks again @Danny-Google for taking this much time to help us.
One thing that came to my mind, Did you use bfloat16 for training the model ? I didn't find any thing in the code related to explicitly using bfloat16.
Hi! @agemagician Could you please share some info about the process of pretraining data? How much time and memory has the create_pretraining.py consumed on your dataset? And what is the size of your dataset? Thanks!
May I ask where did you get the TPUs? Thank you very much.
@agemagician, sorry for the late reply, got super busy recently. About using bfloat16, I don't remember we do that explicitly. Plus, according to here (https://cloud.google.com/tpu/docs/bfloat16), using bfloat16 will not make your running time half.
Hello,
We are training a bioinformatics data using ALBERT-xxlarge on TPU V3-512.
According to the paper you trained "ALBERT-xxlarge" for 125k in 32h.
However, our training will take 7 days to complete 130k.
Our vocab file is only 34 and this is our training command:
I also tried to change the "iterations_per_loop" to 1000 or even bigger but that didn't help.
The current logs from the training is :
It takes around 463 seconds per 100 steps, which means we can train 130k in 7 days. (130000 / 100 ) * 463 = 601900 seconds = 7 days.
The tpu, server and the bucket all at the same region.
In SUMMIT (world fastest computer) I was able to train bert with 30 layers and it took only 24 hours to finish around 122k steps using 6k V100 GPUs with Global batch size of 11k.
Do you have any idea why we can't reproduce the same speed as the paper ?
@0x0539 @Danny-Google Your feedback will be highly appreciated