guotong1988 / BERT-GPU

multi-gpu pre-training in one machine for BERT from scratch without horovod (Data Parallelism)
Apache License 2.0
173 stars 54 forks source link

train_batch_size and time required to pretrain #29

Closed Jimojimojimo closed 3 years ago

Jimojimojimo commented 3 years ago

When I set train_batch_size to 8 and experiment with 8 GPUs, the batch_size will be 64 overall. Then I think the learning speed should be faster than when I did batch_size 8 with a single GPU, but when I actually learned it, it takes a similar amount of time. Is something going wrong?

guotong1988 commented 3 years ago

Learning rate is the same for 8-batch-size and 64-batch-size.

Learning rate is the total learning rate of a batch.

See the readme.

guotong1988 commented 3 years ago

image