New GPU code since 1.5 - Githubissues

pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)

https://pytorch.org/xla

Other

2.48k stars 480 forks source link

New GPU code since 1.5 #2121

Closed cjolivier01 closed 4 years ago

cjolivier01 commented 4 years ago

❓ Questions and Help

I was perusing the diffs between my 1.5 branch and master and seems along with nccl support, GPU is becoming more of a first-class citizen in the code base? Does it now or will do multi-node (not just multi-process) master/slave training with or without mesh service? Maybe it alwasy did and I didn't know how to do it (admittedly, I spend most of my time working on a particularly simple "happy path")?

by the way, the code is beautiful.

dlibenzi commented 4 years ago

I have tested multi-gpu (single host) with pytorch/xla in Google Cloud with 4 V100 and it is working. But it should work even in multi-host multi-gpu mode.

We are about to make pytorch/xla GPU more widely available by creating wheels and Dockers with GPU support enabled. There are 5..6 C++ tests failing, due to some 3D convolutions not being supported, and a corner case issue with scatter. But pytorch tests are passing, and that's big coverage.

cjolivier01 commented 4 years ago

Hmm with a little fidgeting, I got it to do mesh service locally, but then:

tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check failed: session->session()->Run( session_work->feed_inputs, session_work->outputs_handles, &outputs) == ::tensorflow::Status::OK() (Invalid argument: The ClusterSpec names the job and task index to be the same names that were provided when the server booted. This is currently not allowed. Job: wse_worker_1, task index: 0 vs. OK

dlibenzi commented 4 years ago

I have dumped some random notes into a document (attached the PDF) when I did this. As I said, we will be building official wheels and Dockers soon. But if your setup has been built correctly, you should be able to just:

GPU_NUM_DEVICES=4 python test/test_train_mp_mnist.py

XLA GPU New.pdf

cjolivier01 commented 4 years ago

Thanks so much for this! This is fantastic!

dlibenzi commented 4 years ago

This is mostly XLA design beauty. Single high level language, multiple devices.

vfdev-5 commented 4 years ago

we will be building official wheels and Dockers soon.

@dlibenzi it will be a single wheel for TPU and GPU ? And to run on TPU or GPU we just need to change "G" by "T" and vice verse ?

GPU_NUM_DEVICES=8 python test/test_train_mp_mnist.py
# or 
TPU_NUM_DEVICES=8 python test/test_train_mp_mnist.py

dlibenzi commented 4 years ago

Yes, the idea is to have a single wheel and single Docker image, which supports both.