kiddyboots216 / CommEfficient

PyTorch for benchmarking communication-efficient distributed SGD optimization algorithms
71 stars 20 forks source link

ZeroDivisionError: integer division or modulo by zero #11

Open underwhitee opened 1 year ago

underwhitee commented 1 year ago

per_proc = len(worker_batches) // len(self.update_forward_gradps) How can I set the number of processes and clients to avoid "updating forward grad ps" becomes an empty array?

kiddyboots216 commented 1 year ago

Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time. To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes.

underwhitee commented 1 year ago

Sorry to take so long to reply to you. My parameter setting details are as follows: "mode == sketch" "num_clients == 20" "num_workers == 20" "num_devices == 1" "share_ps_gpu , action= "store_true" "

The problem is described as follows:   File "CommEfficient\fed_aggregator.py", line 232, in _call_train     per_proc = len(worker_batches) // len(self.update_forward_grad_ps) ZeroDivisionError: integer division or modulo by zero

I encountered problems when running "cv_train. py". I think these parameters may cause problems. If you need to set other parameters, please let me know. Thank you very much!

------------------ 原始邮件 ------------------ 发件人: "kiddyboots216/CommEfficient" @.>; 发送时间: 2023年3月6日(星期一) 晚上11:22 @.>; @.**@.>; 主题: Re: [kiddyboots216/CommEfficient] ZeroDivisionError: integer division or modulo by zero (Issue #11)

Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time. To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kiddyboots216 commented 1 year ago

Alright, so the issue I think is that you've got 20 clients and 20 workers. This means that you're trying to get through the entire dataset at each iteration. Can you try doing, say, 100 clients and 20 workers? Also, you can try increasing the timeout. Try 900s.

xiayuanj commented 1 year ago

Sorry to take so long to reply to you. My parameter setting details are as follows: "mode == sketch" "num_clients == 20" "num_workers == 20" "num_devices == 1" "share_ps_gpu , action= "store_true" " The problem is described as follows:   File "CommEfficient\fed_aggregator.py", line 232, in _call_train     per_proc = len(worker_batches) // len(self.update_forward_grad_ps) ZeroDivisionError: integer division or modulo by zero I encountered problems when running "cv_train. py". I think these parameters may cause problems. If you need to set other parameters, please let me know. Thank you very much! ------------------ 原始邮件 ------------------ 发件人: "kiddyboots216/CommEfficient" @.>; 发送时间: 2023年3月6日(星期一) 晚上11:22 @.>; @.**@.>; 主题: Re: [kiddyboots216/CommEfficient] ZeroDivisionError: integer division or modulo by zero (Issue #11) Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time. To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Hello, my device only has one GPU, and this problem also occurs when executing the code. Have you solved it?

kiddyboots216 commented 1 year ago

Hi, this error occurs when the worker processes do not enqueue to updating forward grad_ ps in time. Can you try increasing the timeout or increasing the number of clients? If you try for example clients=1 workers=1 then you're trying to do the entire dataset at each iteration, and the default timeout is not long enough (perhaps) to process the entire dataset with only 1 DataLoader worker.

xiayuanj commented 1 year ago

Hi, this error occurs when the worker processes do not enqueue to updating forward grad_ ps in time. Can you try increasing the timeout or increasing the number of clients? If you try for example clients=1 workers=1 then you're trying to do the entire dataset at each iteration, and the default timeout is not long enough (perhaps) to process the entire dataset with only 1 DataLoader worker.

Thanks for your reply. I've tried increasing the number of clients and workers and I still get this problem. I think it's the number of devices that causes this problem, as shown below. My device has only one GPU, so I defined num_device as 1. During code execution, if num_device=1 and share_ps_gpu=False, n_worker_gpus=0. This means that the following "for loop" operation cannot be performed. Therefore, the update_forward_grad_ps list is empty.

image

kiddyboots216 commented 1 year ago

Oh I see! Yeah so you need to set share_ps_gpu=True when you run the code. That way, the workers can share a GPU with the parameter server. This will limit the size of the model you're able to run since you have to hold 2 copies in memory at the same time, but it's necessary if you are running on 1 gpu.

xiayuanj commented 1 year ago

Oh I see! Yeah so you need to set share_ps_gpu=True when you run the code. That way, the workers can share a GPU with the parameter server. This will limit the size of the model you're able to run since you have to hold 2 copies in memory at the same time, but it's necessary if you are running on 1 gpu.

I tried to do this but also encountered a new problem as follows. image So I modified torch.distributed.reduce(sum_g, 0), as follows. image When I modified it, I felt as if the code was in a dead end loop.

kiddyboots216 commented 1 year ago

Could you revert the change to torch.distributed.reduce and add these lines; Could you also try export NCCL_DEBUG=INFO and export NCCL_DEBUG_SUBSYS=ALL

xiayuanj commented 1 year ago

I have tried these commands and they don't work.

torch.distributed.reduce(sum_g, 0) I tried what you said about export NCCL_DEBUG=INFO and export NCCL_DEBUG_SUBSYS=ALL, but still have the same problem.

When I modified torch.distributed.reduce(sum_g, 0), the command line outputs batch queue was empty. image

kiddyboots216 commented 1 year ago

The export commands are just adding some environment variables in to make the error message more useful. The "NCCL error invalid usage" message you were originally getting is not descriptive because it could be a versioning error.