tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.15k stars 632 forks source link

Gradients are not averaged in AllReduceSpec=nccl (variable_update=replicated) mode #466

Closed zhao1157 closed 4 years ago

zhao1157 commented 4 years ago

https://github.com/tensorflow/benchmarks/blob/5d03cf8e356d2ae17df440cdb612c378cbacf5ef/scripts/tf_cnn_benchmarks/batch_allreduce.py#L376

In AllReduceSpec=nccl (variable_update=replicated) mode, the gradients are summed, after which it should be averaged. But as far as I know, it is not averaged before updating the variables. Did I get that right?

reedwm commented 4 years ago

Correct, the gradients are summed instead of averaged. This is true with variable_update=replicated regardless of what the AllReduceSpec is.

Arguably, summing instead of averaging is not a bug. The tf.distribute API also sums gradients instead of averaging them, and expects you to divide the per-replica loss by the number of replicas to compensate. We don't average since it has a performance cost over summing. However, variable_update=parameter_server averages gradients instead IIRC, and this inconsistency is a bug.

Unfortunately, this will not get fixed since tf_cnn_benchmarks is unmaintained.