Different accuracy in distributed vs single machine convergence

aaron276h commented 6 years ago

I have observed repeatedly that running tf_cnn_benchmarks on one machine vs. running it in distributed replicated mode with the same settings (same global mini-batch size) leads to different convergence.

Command I used to run on a single machine with one GPU:

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=replicated --data_name=imagenet --data_dir=$data/imagenet --init_learning_rate=0.010000 --optimizer=momentum --num_epochs=5 --summary_verbosity=1 --train_dir=$output_dir/single_machine

Command I used to run in distributed replicated mode with two machine each with one GPU:

Workers (update task id on second machine):

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=distributed_replicated --data_name=imagenet --data_dir=$data/imagenet --job_name=worker --ps_hosts=host0:2223,host1:2223 --worker_hosts=host0:2224,host1:2224 --init_learning_rate=0.010000 --optimizer=momentum --num_epochs=5 --train_dir=$output_dir/two-machines --summary_verbosity=1 --task_index=0

PS (update task id on second machine):

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=distributed_replicated --data_name=imagenet --data_dir=$data/imagenet-tf/imagenet --job_name=ps --ps_hosts=host0:2223,host1:2223 --worker_hosts=host0:2224,host1:2224 --task_index=0 --init_learning_rate=0.010000 --optimizer=momentum --num_epochs=5 --train_dir=$output_dir/two-machines --summary_verbosity=1 --task_index=0

For these setting I get a top-1 (top-5) accuracy of: After 5 epochs: Single machine: 0.4910 (0.7481) vs. Two machines 0.4655(0.7276) After 10 epochs: Single machine: 0.5789 (0.8190) vs. Two machines 0.5458(0.7977) After 15 epochs: Single machine: 0.6081 (0.8373) vs. Two machines 0.5760(0.8166) After 20 epochs: Single machine: 0.6294 (0.8532) vs. Two machines 0.5957(0.8306)

While these are not large differences, they appear to be consistent across multiple runs. Since there is no learning rate modification at any point, the two runs have the same global mini batch size, and train on the same amount of data, I have not been able to find the cause of this discrepancy.

For completeness I am attaching the command I use to the testing:

python tf_cnn_benchmarks.py --eval=True --local_parameter_device=gpu --num_gpus=1 --batch_size=100 --model=resnet50 --variable_update=replicated --data_name=imagenet --data_dir=$data/imagenet --num_epochs=1 --train_dir=$PATH_TO_TRAIN_DIR_EVALUATED --summary_verbosity=1

reedwm commented 6 years ago

Try halving the learning rate in the distributed case. In the distributed case, each worker applies its gradients to the variable. Therefore, in distributed, every step, two gradients of batch size 32 are applied to each variable, while in nondistributed, one gradient of batch size 64 is applied to each variable.

aaron276h commented 6 years ago

@reedwm Thanks for the suggestion, I will try that and post the results.

Just to confirm my understanding of the benchmarking code. In the distributed_replicated (VariableMgrDistributedReplicated) mode, the gradients are never divided by mini-batch size? If it is dividing the gradient by mini-batch size before applying it, I suspect dividing the learning rate in half for the distributed case (batch size of 32 per machine) would not be the same as using the original learning rate with batch size 64 on one machine.

reedwm commented 6 years ago

Yeah, you are correct. In replicated and distributed_replicated, gradients are added and not divided by mini-batch size. In parameter server, gradients are divided by mini-batch size (so the mean is effectively taken). So in replicated and distributed replicated, the learning rate must be adjusted. But in parameter server, the learning rate would not need to be adjusted.

This inconsistent behavior is very confusing, and I apologize for this. I hope to simplify this logic in the future.

aaron276h commented 6 years ago

@reedwm I experimented with using half the learning rate, and the result was even a bit further away: Single machine: 0.47, Two machine same LR: 0.44, Two machine half LR: 0.43. I also tried turning off batch normalization for all the configurations but that did reduce the discrepancy. Any suggestions on what we could try next?

reedwm commented 6 years ago

I was wrong when I said you should halve the learning rate. Sorry about that! It turns out, in replicated/distributed_replicated, the learning rate should not be adjusted. This is because the gradients are summed instead of averaged, as I stated before. But this summing is good; if the gradients weren't summed but instead averaged, then you would want to increase the learning rate by the factor of the number of workers. (And, it turns out gradients are summed in parameter_server mode as well across workers, but averaged across GPUs. But you don't have to worry about that).

So, I'm unsure why you are not getting hte same accuracy with the same LR. The most likely scenario is that shift_ratio (set here) is not done when datasets is enabled, which is the default. shift_ratio makes sure each worker processes different images, which will help convergence. Try running with --nouse_datasets --nodatasets_use_prefetch, which will disable datasets, enabling shift_ratio.

Another thing to try is setting --optimizer=sgd instead of --optimizer=momentum. The momentum optimizer has a momentum variable for each normal variable, but these momentum variables are shared among workers. So each worker updates the momentum variable independently, meaning it will be updated twice as frequently as in the 1-worker case.

So in short, try using --nouse_datasets --nodatasets_use_prefetch --optimizer=sgd and remove --optimizer=momentum. I should add that as far as I know, we have not tested distributed convergence yet. So we could have other problems as well.

aaron276h commented 6 years ago

@reedwm Thanks, I will try running both single machine and distributed replicated with those flags, and post the results when they finish. I have tried running with SGD instead of momentum earlier, and those runs showed a discrepancy between single and multi-machine experiments.

aaron276h commented 6 years ago

We tried running with those config options and we found that: SGD 1 Machine - MB 128 - LR 0.01 = Accuracy @ 1 = 0.2851 Accuracy @ 5 = 0.5342 SGD 2 Machine - MB 64+64 - LR 0.01 = Accuracy @ 1 = 0.3675 Accuracy @ 5 = 0.6346 SGD 2 Machine - MB 64+64 - LR 0.005 = Accuracy @ 1 = 0.2612 Accuracy @ 5 = 0.5018

Hannah-xxl commented 6 years ago

@reedwm I am afraid that tf_cnn_benchmark ignores shift_ratio when use --nouse_datasets and --nodatasets_use_prefetch also, because _build_image_processing takes shift_ratio 0 as input in _build_model function, and self.image_preprocessr.minibatch only uses self.shift_ratio instead when shift_ratio input is less than zero. So the shift_ratio is always 0 when goes _build_model function no matter which worker.

tensorflow / benchmarks

Different accuracy in distributed vs single machine convergence #222