Open bradmiro opened 3 years ago
@gogasca any idea? guess we need to upgrade the PyTorch example script. I don't see TonY or GCP being the issue here.
Great observation and I believe you are correct: here it shows the tcp
backend being used. Adding --backend gloo
or --backend nccl
(on a gpu cluster) to --task_params
changed the error message, so it looks like the example just needs a refresh.
@bradmiro would you mind contributing a patch to fix that?
Sure, I can look into this.
@oliverhu are there special considerations that need to be taken into consideration re: TonY for use with PyTorch? The error seems to be properly configuring init_process_group.
The current code is this: https://github.com/linkedin/TonY/blob/master/tony-examples/mnist-pytorch/mnist_distributed.py#L184-L189
Changing the backend to gloo
throws "connection refused" errors at runtime.
That should not matter, all those backend should work 🤔 Have you tried other backends?
The mpi
runtime does not work without an installation and we don't include this by default in the Dataproc image.
The nccl
does not seem to work, but I am also testing on a cluster that only has GPUs allocated to workers, not the master. The TensorFlow job seemed to work with GPUs just attached to master, but I am creating a fresh cluster with a GPU attached to the master node as well.
nccl
error with gpus attached to all machines: RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
This might be a PyTorch thing, I can look into it more probably early next week. Unsure about gloo
as well.
mpi
won't work because that requires SSH across workers, that is not something supported by default in Hadoop distributions.
nccl
and gloo
should work though at a glance. We use TensorFlow so not much insight there, but anything not using MPI should work.
Hi there, I'm working on an update for the TonY installation script for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning supporting) or 1.7.1, the most recent release. I get the following error:
Latest attempt: PyTorch 1.7.1 torchvision 0.8.2 TonY 0.4.0 Dataproc 2.0 (Hadoop 3.2.1)
Config:
Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s. However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!