Closed gth828r closed 5 years ago
One related note is that I was not having any luck getting things to run when explicitly specifying the backend as nccl or gloo, but I found at least one workaround that may belong in the documentation. nccl was throwing exceptions and the runtime seemed to hang due to lack of communications (likely related to https://github.com/pytorch/pytorch/issues/18300) and gloo was throwing issues related to hostname lookup which I still do not fully understand.
In my case, I found that setting GLOO_SOCKET_IFNAME
helped me work around the latter issue, and I am now properly up and running. With a local machine, this value can be set universally to lo
, although a general solution is obviously more complex.
Updated the commands in READMEs to use --distributed_backend
. Sorry for the documentation issues -- some of the commands haven't been updated as we've updated various parts of the system!
Regarding needing to set GLOO_SOCKET_IFNAME
-- this is a known issue when using the gloo
backend with PyTorch sometimes. For example: https://discuss.pytorch.org/t/try-to-use-docker-cluster-without-gpu-to-run-distributed-training-but-connect-refused/52288/3.
NCCL backend for hybrid setups is still a work in progress (as you saw in that PyTorch issue). You want to use NCCL if you want to run a pure Data Parallelism setup though.
Me again... this one is not urgent, and it may not even be an issue, but I want to capture it as I go just in case.
The top level README and the runtime README both have examples of running the main_with_runtime.py without setting the
--distributed_backend
parameter. When I try to run a single-machine-multi-gpu hybrid parallel scenario, if I do not specify that parameter then I see the following error get raised by torch:I am running each command as follows:
Where ID is 0, 1, 2, and 3 for the four different processes I am trying to run. Does the documentation need updating, or am I doing things incorrectly?