kiddyboots216 / CommEfficient

PyTorch for benchmarking communication-efficient distributed SGD optimization algorithms
71 stars 20 forks source link

Reproduce the results in the paper #8

Open jiahuigeng opened 1 year ago

jiahuigeng commented 1 year ago

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands. But the logs seem not correct. Could you share the command line you are using in the paper? I am really interested in your work and willing to explore more about sketch techniques.

python cv_train.py --dataset_name CIFAR10 --iid --num_workers 2 --lr_scale 0.4 --local_momentum=0.0 --num_devices 2 --num_devices=2 --num_clients 2

MY PID: 31280 5315 port in use, trying next... Namespace(checkpoint_path='./checkpoint', dataset_dir='./dataset', dataset_name='CIFAR10', device='cuda', do_batchnorm=False, do_checkpoint=False, do_dp=False, do_finetune=False, do_iid=True, do_test=False, do_topk_down=False, dp_mode='worker', error_type='none', eval_before_start=False, fedavg_batch_size=-1, fedavg_lr_decay=1, finetune_path='./finetune', finetuned_from=None, k=50000, l2_norm_clip=1.0, lm_coef=1.0, local_batch_size=8, local_momentum=0.0, lr_scale=0.4, max_grad_norm=None, max_history=2, mc_coef=1.0, microbatch_size=-1, mode='sketch', model='ResNet9', model_checkpoint='gpt2', nan_threshold=999, noise_multiplier=0.0, num_blocks=20, num_candidates=2, num_clients=2, num_cols=500000, num_devices=2, num_epochs=24, num_fedavg_epochs=1, num_results_train=2, num_results_val=2, num_rows=5, num_workers=2, personality_permutations=1, pivot_epoch=5, port=5646, seed=21, share_ps_gpu=False, train_dataloader_workers=0, use_tensorboard=False, val_dataloader_workers=0, valid_batch_size=8, virtual_momentum=0, weight_decay=0.0005) 50000 625 Using BatchNorm: False Finished initializing in 11.00 seconds miniconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). " CommEfficient/CommEfficient/utils.py:258: UserWarning: This overload of add is deprecated: add(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1055.) gradvec.add(args.weight_decay / args.num_workers, weights) epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time 1 0.0800 655.4752 2.3025 0.1009 2.3025 0.1014 0 59606 679.6477 2 0.1600 649.9156 2.3025 0.1008 2.3025 0.1014 0 59606 1343.1710 3 0.2400 649.3290 2.3025 0.1011 2.3025 0.1014 0 59606 2006.0574

kiddyboots216 commented 1 year ago

Hello. Could you try using the settings that we use in the paper? So don't add the --iid flag and use the number of workers and number of clients that we use instead of 2. When you use 2 clients and 2 workers this means that you are splitting the entire CIFAR10 dataset into 2 chunks, and then doing training with the entire dataset at each epoch. For this setting, that is near identical to full-batch training, you may need to follow the optimization guidelines in something like the LAMB optimizer.

Antonio-demo commented 1 year ago

Hello, I'm trying to reproduce your experimental results through the code provided by this paper, but I cannot correctly run your paper's code. So I want to know how to correctly run this code?

kiddyboots216 commented 1 year ago

Hi @Antonio-demo I think you can create a new issue and provide some more details, e.g. the command that you are running.