How to support single-process-multiple-devices in DistributedDataParallel other than CUDA device

JohnLLLL commented 4 years ago

Hi,

I am investigating to extend the DistributedDataParallel to other accelerator devices than CUDA devices. Not only to support single-process-single-device but also to support the single-process-multiple-devices and multple-processes-multiple-devices.

There are a lot of CUDA dependency in the DistributedDataParallel.

My question is:

How to override CUDA logical dependency and dispatch the gather and scatter (and other APIs used) to the c10d backend without modifying the distributed.py ? https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

pritamdamania87 commented 3 years ago

Note that we are actually deprecating single-process-multiple-device mode from DDP and plan to only support single-process-single-device: https://github.com/pytorch/pytorch/issues/47012.

I don't think gather and scatter are called anymore after the deprecation. @SciPioneer We probably should remove gather and scatter as well since they are unused?

Although, note that a lot of heavy lifting for DDP actually happens in the c10d reducer where there is a lot of CUDA dependency (ex: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/reducer.cpp#L592). AMD ROCm got around this by basically mimicking CUDA APIs using hipify (see https://github.com/pytorch/pytorch/blob/c371542efc31b1abfe6f388042aa3ab0cef935f2/c10/cuda/README.md and https://github.com/pytorch/pytorch/blob/ba694520e5004b74b575614f9d7f86a26436d61b/tools/amd_build/build_amd.py)

I think the only way to support this would be to make the c10d reducer device agnostic cc @zhaojuanmao @rohan-varma

zhaojuanmao commented 3 years ago

this may be something to be considered in composable DDP, cc @mrshenli @SciPioneer

pytorch / pytorch

How to support single-process-multiple-devices in DistributedDataParallel other than CUDA device #35372