kiddyboots216 / CommEfficient

PyTorch for benchmarking communication-efficient distributed SGD optimization algorithms
71 stars 20 forks source link

Question about Cifar10 experiment command #4

Closed JuttaZhang closed 3 years ago

JuttaZhang commented 3 years ago

Hello, authors. When I run "python cv_train.py", the train_loss and train_acc are always NaNs. There is my command:

python cv_train.py --dataset_dir /home/data/cifar --dataset_name CIFAR10 --num_results_train 1 --train_dataloader_workers 4 --val_dataloader_workers 4 --num_devices 2 --error_type virtual --lr_scale 0.3 --num_workers 4 —num_clients 10000

Have I missed some hyperparameters that are important? And would you please be able to provide the exact set of commands? It would be helpful if you were to share your commands.

kiddyboots216 commented 3 years ago

Hello. Num_workers 4 will indicate that there are only 4 devices participating at every iteration, so out of 10000 devices which each have 5 datapoints, you will only have 4 * 5 = 20 datapoints sampled at every iteration. Because the data distribution is non-iid by default, this would mean that you are only looking at 4/10 classes at every iteration. This in and of itself will make it unlikely that the model will converge. Also, by default local_momentum=0.9 and virtual_momentum=0.0. As we note in our paper, local_momentum is rarely beneficial for convergence. So you will want to specify to switch these in your command. Also, there should be 2 num_results_train...I am guessing you modified the code somewhere.

JuttaZhang commented 3 years ago

Thank you so much for your quick reply. I'll try it right now.

howard-yen commented 3 years ago

Is it possible to share the exact commands that you used to produce the results in the paper? Such as num_devices, num_workers and so on?