When RETURNN is training with multiple GPU using horovod, is it necessary to decline the batch_size?

rwth-i6 / returnn-experiments

experiments with RETURNN

154 stars 44 forks source link

When RETURNN is training with multiple GPU using horovod, is it necessary to decline the batch_size? #39

Closed yanghongjiazheng closed 4 years ago

yanghongjiazheng commented 4 years ago

Dear friends, thanks for your great work. I have tried to train the returnn with multiple GPUs using horovod. However, the result is not quite good. Multi-Gpus didn't save much time for us. So I have to suspect that maybe the multi-gpus training actually is several single GPU training combined. Every training still hands on their own things. After one epoch, they will communicate and refine weights? So in this way, we actually train the model with times of epoch? Then is it necessary to decline the batch_size to get the Multi-Gpus training results? When RETURNN is training with multiple GPU using horovod, is it necessary to decline the batch_size?

albertz commented 4 years ago

Did you follow the multi-GPU documentation?

Esp important:

horovod_reduce_type = "param"
horovod_param_sync_step = 1000 (or sth like that)
Use HDFDataset, and set cache_size = 0.
Check the log for sth like (99.3% computing time). That percentage should be close to 100%.

Please make sure that you have all that correct.

Maybe @pavelgolik or @curufinwe can help further here.