image_processing main_with_runtime.py examples probably need to include distributed_backend parameter

msr-fiddle / pipedream

MIT License

379 stars 117 forks source link

image_processing main_with_runtime.py examples probably need to include distributed_backend parameter #6

Closed gth828r closed 5 years ago

gth828r commented 5 years ago

Me again... this one is not urgent, and it may not even be an issue, but I want to capture it as I go just in case.

The top level README and the runtime README both have examples of running the main_with_runtime.py without setting the --distributed_backend parameter. When I try to run a single-machine-multi-gpu hybrid parallel scenario, if I do not specify that parameter then I see the following error get raised by torch:

ValueError: Backend name must be a string, but got: None

I am running each command as follows:

python main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir /mnt --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf.json -v 1 --rank <ID> --local_rank <ID>

Where ID is 0, 1, 2, and 3 for the four different processes I am trying to run. Does the documentation need updating, or am I doing things incorrectly?

gth828r commented 5 years ago

One related note is that I was not having any luck getting things to run when explicitly specifying the backend as nccl or gloo, but I found at least one workaround that may belong in the documentation. nccl was throwing exceptions and the runtime seemed to hang due to lack of communications (likely related to https://github.com/pytorch/pytorch/issues/18300) and gloo was throwing issues related to hostname lookup which I still do not fully understand.

In my case, I found that setting GLOO_SOCKET_IFNAME helped me work around the latter issue, and I am now properly up and running. With a local machine, this value can be set universally to lo, although a general solution is obviously more complex.

deepakn94 commented 5 years ago

Updated the commands in READMEs to use --distributed_backend. Sorry for the documentation issues -- some of the commands haven't been updated as we've updated various parts of the system!

Regarding needing to set GLOO_SOCKET_IFNAME -- this is a known issue when using the gloo backend with PyTorch sometimes. For example: https://discuss.pytorch.org/t/try-to-use-docker-cluster-without-gpu-to-run-distributed-training-but-connect-refused/52288/3.

NCCL backend for hybrid setups is still a work in progress (as you saw in that PyTorch issue). You want to use NCCL if you want to run a pure Data Parallelism setup though.