tony-framework / TonY

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
https://tony-project.ai
Other
706 stars 163 forks source link

PyTorch Support #493

Open bradmiro opened 3 years ago

bradmiro commented 3 years ago

Hi there, I'm working on an update for the TonY installation script for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning supporting) or 1.7.1, the most recent release. I get the following error:

  File "mnist_distributed.py", line 230, in <module>
    main()
  File "mnist_distributed.py", line 225, in main
    init_process(args)
  File "mnist_distributed.py", line 185, in init_process
    distributed.init_process_group(
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 413, in init_process_group
    backend = Backend(backend)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 76, in __new__
    raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backend for collective operations on CPU tensors.

Latest attempt: PyTorch 1.7.1 torchvision 0.8.2 TonY 0.4.0 Dataproc 2.0 (Hadoop 3.2.1)

Config:

<configuration>
 <property>
  <name>tony.application.name</name>
  <value>PyTorch</value>
 </property>
 <property>
  <name>tony.application.security.enabled</name>
  <value>false</value>
 </property>
 <property>
  <name>tony.worker.instances</name>
  <value>2</value>
 </property>
 <property>
  <name>tony.worker.memory</name>
  <value>4g</value>
 </property>
 <property>
  <name>tony.ps.instances</name>
  <value>1</value>
 </property>
 <property>
  <name>tony.ps.memory</name>
  <value>2g</value>
 </property>
 <property>
  <name>tony.application.framework</name>
  <value>pytorch</value>
 </property>
 <property>
  <name>tony.worker.gpus</name>
  <value>1</value>
 </property>
</configuration>

Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s. However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!

oliverhu commented 3 years ago

@gogasca any idea? guess we need to upgrade the PyTorch example script. I don't see TonY or GCP being the issue here.

bradmiro commented 3 years ago

Great observation and I believe you are correct: here it shows the tcp backend being used. Adding --backend gloo or --backend nccl (on a gpu cluster) to --task_params changed the error message, so it looks like the example just needs a refresh.

oliverhu commented 3 years ago

@bradmiro would you mind contributing a patch to fix that?

bradmiro commented 3 years ago

Sure, I can look into this.

bradmiro commented 3 years ago

@oliverhu are there special considerations that need to be taken into consideration re: TonY for use with PyTorch? The error seems to be properly configuring init_process_group.

The current code is this: https://github.com/linkedin/TonY/blob/master/tony-examples/mnist-pytorch/mnist_distributed.py#L184-L189

Changing the backend to gloo throws "connection refused" errors at runtime.

oliverhu commented 3 years ago

That should not matter, all those backend should work 🤔 Have you tried other backends?

bradmiro commented 3 years ago

The mpi runtime does not work without an installation and we don't include this by default in the Dataproc image.

The nccl does not seem to work, but I am also testing on a cluster that only has GPUs allocated to workers, not the master. The TensorFlow job seemed to work with GPUs just attached to master, but I am creating a fresh cluster with a GPU attached to the master node as well.

bradmiro commented 3 years ago

nccl error with gpus attached to all machines: RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

This might be a PyTorch thing, I can look into it more probably early next week. Unsure about gloo as well.

oliverhu commented 3 years ago

mpi won't work because that requires SSH across workers, that is not something supported by default in Hadoop distributions.

nccl and gloo should work though at a glance. We use TensorFlow so not much insight there, but anything not using MPI should work.