Open jiahuigeng opened 1 year ago
Hello. Could you try using the settings that we use in the paper? So don't add the --iid flag and use the number of workers and number of clients that we use instead of 2. When you use 2 clients and 2 workers this means that you are splitting the entire CIFAR10 dataset into 2 chunks, and then doing training with the entire dataset at each epoch. For this setting, that is near identical to full-batch training, you may need to follow the optimization guidelines in something like the LAMB optimizer.
Hello, I'm trying to reproduce your experimental results through the code provided by this paper, but I cannot correctly run your paper's code. So I want to know how to correctly run this code?
Hi @Antonio-demo I think you can create a new issue and provide some more details, e.g. the command that you are running.
Hi, I tried to reproduce the experiment results in the paper. I am using the following commands. But the logs seem not correct. Could you share the command line you are using in the paper? I am really interested in your work and willing to explore more about sketch techniques.
python cv_train.py --dataset_name CIFAR10 --iid --num_workers 2 --lr_scale 0.4 --local_momentum=0.0 --num_devices 2 --num_devices=2 --num_clients 2
MY PID: 31280 5315 port in use, trying next... Namespace(checkpoint_path='./checkpoint', dataset_dir='./dataset', dataset_name='CIFAR10', device='cuda', do_batchnorm=False, do_checkpoint=False, do_dp=False, do_finetune=False, do_iid=True, do_test=False, do_topk_down=False, dp_mode='worker', error_type='none', eval_before_start=False, fedavg_batch_size=-1, fedavg_lr_decay=1, finetune_path='./finetune', finetuned_from=None, k=50000, l2_norm_clip=1.0, lm_coef=1.0, local_batch_size=8, local_momentum=0.0, lr_scale=0.4, max_grad_norm=None, max_history=2, mc_coef=1.0, microbatch_size=-1, mode='sketch', model='ResNet9', model_checkpoint='gpt2', nan_threshold=999, noise_multiplier=0.0, num_blocks=20, num_candidates=2, num_clients=2, num_cols=500000, num_devices=2, num_epochs=24, num_fedavg_epochs=1, num_results_train=2, num_results_val=2, num_rows=5, num_workers=2, personality_permutations=1, pivot_epoch=5, port=5646, seed=21, share_ps_gpu=False, train_dataloader_workers=0, use_tensorboard=False, val_dataloader_workers=0, valid_batch_size=8, virtual_momentum=0, weight_decay=0.0005) 50000 625 Using BatchNorm: False Finished initializing in 11.00 seconds miniconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()
beforelr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call oflr_scheduler.step()
beforeoptimizer.step()
. " CommEfficient/CommEfficient/utils.py:258: UserWarning: This overload of add is deprecated: add(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1055.) gradvec.add(args.weight_decay / args.num_workers, weights) epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time 1 0.0800 655.4752 2.3025 0.1009 2.3025 0.1014 0 59606 679.6477 2 0.1600 649.9156 2.3025 0.1008 2.3025 0.1014 0 59606 1343.1710 3 0.2400 649.3290 2.3025 0.1011 2.3025 0.1014 0 59606 2006.0574