Closed cjolivier01 closed 4 years ago
I have tested multi-gpu (single host) with pytorch/xla in Google Cloud with 4 V100 and it is working. But it should work even in multi-host multi-gpu mode.
We are about to make pytorch/xla GPU more widely available by creating wheels and Dockers with GPU support enabled. There are 5..6 C++ tests failing, due to some 3D convolutions not being supported, and a corner case issue with scatter. But pytorch tests are passing, and that's big coverage.
Hmm with a little fidgeting, I got it to do mesh service locally, but then:
tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check failed: session->session()->Run( session_work->feed_inputs, session_work->outputs_handles, &outputs) == ::tensorflow::Status::OK() (Invalid argument: The ClusterSpec names the job and task index to be the same names that were provided when the server booted. This is currently not allowed. Job: wse_worker_1, task index: 0 vs. OK
I have dumped some random notes into a document (attached the PDF) when I did this. As I said, we will be building official wheels and Dockers soon. But if your setup has been built correctly, you should be able to just:
GPU_NUM_DEVICES=4 python test/test_train_mp_mnist.py
Thanks so much for this! This is fantastic!
This is mostly XLA design beauty. Single high level language, multiple devices.
we will be building official wheels and Dockers soon.
@dlibenzi it will be a single wheel for TPU and GPU ? And to run on TPU or GPU we just need to change "G" by "T" and vice verse ?
GPU_NUM_DEVICES=8 python test/test_train_mp_mnist.py
# or
TPU_NUM_DEVICES=8 python test/test_train_mp_mnist.py
Yes, the idea is to have a single wheel and single Docker image, which supports both.
❓ Questions and Help
I was perusing the diffs between my 1.5 branch and master and seems along with nccl support, GPU is becoming more of a first-class citizen in the code base? Does it now or will do multi-node (not just multi-process) master/slave training with or without mesh service? Maybe it alwasy did and I didn't know how to do it (admittedly, I spend most of my time working on a particularly simple "happy path")?
by the way, the code is beautiful.