Horovod discrepancies on eval_during_training_every_epochs

When I ran tf_cnn_benchmarks with and without horovod, I got different evaluation sequences.

Without horovod: python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 6 I got two evaluation points: one for each epoch, as specified in the command line options. ... Running evaluation at global_step 1679 ... Running final evaluation at global_step 3347

However, with horovod, there was only one evaluation point at the end of the the whole training (2 epochs). mpirun -np 6 -H p10login1:6 python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 1 --variable_update horovod ... Running final evaluation at global_step 3347

It did not do evaluation at the end of the first epoch, which is step number 1679. I found the cause is from self.batch_size on lines 1553 and 1554 of benchmark_cnn.py. Replacing it by (self.batch_size * self_num_workers) seemed working.

tensorflow / benchmarks

Horovod discrepancies on eval_during_training_every_epochs #396