kiddyboots216 / CommEfficient

PyTorch for benchmarking communication-efficient distributed SGD optimization algorithms
72 stars 20 forks source link

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe' #7

Closed HernandezEduin closed 2 years ago

HernandezEduin commented 2 years ago

Hello,

I'm trying to run the code in cv_train.py using the command line arguments "--dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 1 --num_clients 1" with pytorch 1.8.0, however I am met with the following error:

File "D:\CommEfficient\CommEfficient\cv_train.py", line 405, in lr_scheduler = LambdaLR(opt, lr_lambda=lambda_step)

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 203, in init super(LambdaLR, self).init(optimizer, last_epoch, verbose)

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 77, in init self.step()

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 152, in step values = self.get_lr()

File "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 250, in get_lr return [base_lr * lmbda(self.last_epoch)

File` "D:\anaconda3\envs\test\lib\site-packages\torch\optim\lr_scheduler.py", line 250, in return [base_lr * lmbda(self.last_epoch)

File "D:\Dewen\GitHub\CommEfficient-master\CommEfficient\cv_train.py", line 404, in lambda_step = lambda step: lr_schedule(step / spe)

File "D:\Dewen\GitHub\CommEfficient-master\CommEfficient\utils.py", line 28, in call return np.interp([t], self.knots, self.vals)[0]

File "<__array_function__ internals>", line 180, in interp File "D:\anaconda3\envs\test\lib\site-packages\numpy\lib\function_base.py", line 1570, in interp return interp_func(x, xp, fp, left, right)

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

Have you encountered this error? I've tried running the code in both a Linux OS and Windows OS, but exhibit the same issue. Am I missing a command line argument?

HernandezEduin commented 2 years ago

What python and pytorch version do you use?

jiahuigeng commented 2 years ago

What python and pytorch version do you use?

Hi, I think this problem comes with the parameters. Would you link to share your runable commands in this repo?

HernandezEduin commented 2 years ago

Yes, I tried the following parameters:

  1. --dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 1 --num_clients 1
  2. --dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 2 --num_clients 2 --num_results_train 22
  3. --dataset_name CIFAR10 --num_results_train 2 --train_dataloader_workers 4 --val_dataloader_workers 4 --num_devices 2 --error_type virtual --lr_scale 0.3 --num_workers 4 --num_clients 10 --local_momentum 0

All which show the above error.

jiahuigeng commented 2 years ago

Yes, I tried the following parameters:

  1. --dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 1 --num_clients 1
  2. --dataset_name CIFAR10 --iid --share_ps_gpu --num_workers 2 --num_clients 2 --num_results_train 22
  3. --dataset_name CIFAR10 --num_results_train 2 --train_dataloader_workers 4 --val_dataloader_workers 4 --num_devices 2 --error_type virtual --lr_scale 0.3 --num_workers 4 --num_clients 10 --local_momentum 0

All which show the above error.

I believe --lr_scale 0.4 can save me from this error.

But I will further meet

CommEfficient/CommEfficient/fed_worker.py", line 228, in local_step assert args.mode != "sketch" AssertionError

with the first two commands and

CommEfficient/fed_aggregator.py", line 230, in _call_train per_proc = len(worker_batches) // len(self.update_forward_grad_ps) ZeroDivisionError: integer division or modulo by zero

with the third command.

if you manage to run this project, please share it with me.

kiddyboots216 commented 2 years ago

Hello, let me try to address some of these issues.

The first issue that you are having is that you have not passed an argument for the learning rate, so as mentioned you can fix this by either passing a default_lr argument in cv_train or by passing the --lr_scale argument to the training script.

The error on line 228 occurs when you try to use local momentum (which is enabled by default) with the mode==sketch (and the mode is sketch by default). So you will need to modify the arguments here, either pass local_momentum=0.0 or pass a mode that is not sketching.

The third error would occur because of a multiprocessing error. You passed num_devices = 2 and you did not specify share_ps_gpu, so the n_worker_gpus is set to 1. We expect this 1 worker process to start, but if there's any error or slowdown in initializing this worker, self.update_forward_grad_ps will be an empty array by the time that we get around to processing it in line 230. You can use the cProfiler to inspect why this may not be working.

kiddyboots216 commented 2 years ago

Actually I'll leave this open in case you have any other issues.

HernandezEduin commented 2 years ago

Thanks, those actually solved my issues.