When I ran tf_cnn_benchmarks with and without horovod, I got different evaluation sequences.
Without horovod:
python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 6
I got two evaluation points: one for each epoch, as specified in the command line options.
...
Running evaluation at global_step 1679
...
Running final evaluation at global_step 3347
However, with horovod, there was only one evaluation point at the end of the the whole training (2 epochs).
mpirun -np 6 -H p10login1:6 python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 1 --variable_update horovod
...
Running final evaluation at global_step 3347
It did not do evaluation at the end of the first epoch, which is step number 1679. I found the cause is from self.batch_size on lines 1553 and 1554 of benchmark_cnn.py. Replacing it by (self.batch_size * self_num_workers) seemed working.
Unfortunately Horovod is not well tested. Since tf_cnn_benchmarks is unmaintained and I don't know how to run with Horovod, this will likely not be fixed.
When I ran tf_cnn_benchmarks with and without horovod, I got different evaluation sequences.
Without horovod: python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 6 I got two evaluation points: one for each epoch, as specified in the command line options. ... Running evaluation at global_step 1679 ... Running final evaluation at global_step 3347
However, with horovod, there was only one evaluation point at the end of the the whole training (2 epochs). mpirun -np 6 -H p10login1:6 python tf_cnn_benchmarks.py --data_dir ${HOME}/mldl/data/imagenet --model resnet50 --batch_size 128 --num_epochs 2 --eval_during_training_every_n_epochs 1 --num_gpus 1 --variable_update horovod ... Running final evaluation at global_step 3347
It did not do evaluation at the end of the first epoch, which is step number 1679. I found the cause is from self.batch_size on lines 1553 and 1554 of benchmark_cnn.py. Replacing it by (self.batch_size * self_num_workers) seemed working.