Closed JuttaZhang closed 3 years ago
Hello. Num_workers 4 will indicate that there are only 4 devices participating at every iteration, so out of 10000 devices which each have 5 datapoints, you will only have 4 * 5 = 20 datapoints sampled at every iteration. Because the data distribution is non-iid by default, this would mean that you are only looking at 4/10 classes at every iteration. This in and of itself will make it unlikely that the model will converge. Also, by default local_momentum=0.9 and virtual_momentum=0.0. As we note in our paper, local_momentum is rarely beneficial for convergence. So you will want to specify to switch these in your command. Also, there should be 2 num_results_train...I am guessing you modified the code somewhere.
Thank you so much for your quick reply. I'll try it right now.
Is it possible to share the exact commands that you used to produce the results in the paper? Such as num_devices
, num_workers
and so on?
Hello, authors. When I run "python cv_train.py", the train_loss and train_acc are always NaNs. There is my command:
Have I missed some hyperparameters that are important? And would you please be able to provide the exact set of commands? It would be helpful if you were to share your commands.