Closed yanghongjiazheng closed 4 years ago
Did you follow the multi-GPU documentation?
Esp important:
horovod_reduce_type = "param"
horovod_param_sync_step = 1000
(or sth like that)HDFDataset
, and set cache_size = 0
.(99.3% computing time)
. That percentage should be close to 100%.Please make sure that you have all that correct.
Maybe @pavelgolik or @curufinwe can help further here.
Dear friends, thanks for your great work. I have tried to train the returnn with multiple GPUs using horovod. However, the result is not quite good. Multi-Gpus didn't save much time for us. So I have to suspect that maybe the multi-gpus training actually is several single GPU training combined. Every training still hands on their own things. After one epoch, they will communicate and refine weights? So in this way, we actually train the model with times of epoch? Then is it necessary to decline the batch_size to get the Multi-Gpus training results? When RETURNN is training with multiple GPU using horovod, is it necessary to decline the batch_size?