Closed HernandezEduin closed 2 years ago
What python and pytorch version do you use?
What python and pytorch version do you use?
Hi, I think this problem comes with the parameters. Would you link to share your runable commands in this repo?
Yes, I tried the following parameters:
All which show the above error.
Yes, I tried the following parameters:
- --dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 1 --num_clients 1
- --dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 2 --num_clients 2 --num_results_train 22
- --dataset_name CIFAR10 --num_results_train 2 --train_dataloader_workers 4 --val_dataloader_workers 4 --num_devices 2 --error_type virtual --lr_scale 0.3 --num_workers 4 --num_clients 10 --local_momentum 0
All which show the above error.
I believe --lr_scale 0.4 can save me from this error.
But I will further meet
CommEfficient/CommEfficient/fed_worker.py", line 228, in local_step assert args.mode != "sketch" AssertionError
with the first two commands and
CommEfficient/fed_aggregator.py", line 230, in _call_train per_proc = len(worker_batches) // len(self.update_forward_grad_ps) ZeroDivisionError: integer division or modulo by zero
with the third command.
if you manage to run this project, please share it with me.
Hello, let me try to address some of these issues.
The first issue that you are having is that you have not passed an argument for the learning rate, so as mentioned you can fix this by either passing a default_lr argument in cv_train or by passing the --lr_scale argument to the training script.
The error on line 228 occurs when you try to use local momentum (which is enabled by default) with the mode==sketch (and the mode is sketch by default). So you will need to modify the arguments here, either pass local_momentum=0.0 or pass a mode that is not sketching.
The third error would occur because of a multiprocessing error. You passed num_devices = 2 and you did not specify share_ps_gpu, so the n_worker_gpus is set to 1. We expect this 1 worker process to start, but if there's any error or slowdown in initializing this worker, self.update_forward_grad_ps will be an empty array by the time that we get around to processing it in line 230. You can use the cProfiler to inspect why this may not be working.
Actually I'll leave this open in case you have any other issues.
Thanks, those actually solved my issues.
Hello,
I'm trying to run the code in cv_train.py using the command line arguments "--dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 1 --num_clients 1" with pytorch 1.8.0, however I am met with the following error:
Have you encountered this error? I've tried running the code in both a Linux OS and Windows OS, but exhibit the same issue. Am I missing a command line argument?