pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.45k stars 462 forks source link

XLA Multi-Node Multi GPU training #3162

Closed codeislife99 closed 2 years ago

codeislife99 commented 2 years ago

🐛 Bug

Running XLA MultiGPU MultiNode configuration fails with XRT OOM for all models / configurations (including MNIST)

To Reproduce

Steps to reproduce the behavior:

  1. Use latest test_train_mp_mnist.py script in test folder
  2. Use multi-node multi-GPU training on any V100 instance
[1,2]<stderr>:Convolution performance may be suboptimal.
[1,7]<stderr>:2021-10-12 20:59:07.286142: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,2]<stderr>:2021-10-12 20:59:07.290330: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,2]<stderr>:2021-10-12 20:59:07.363398: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.31MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
[1,2]<stderr>:2021-10-12 20:59:07.363443: W tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:729] Failed to determine best cudnn convolution algorithm: Resource exhausted: Out of memory while trying to allocate 17104896 bytes.
[1,2]<stderr>:
[1,2]<stderr>:Convolution performance may be suboptimal.
[1,3]<stderr>:2021-10-12 20:59:07.367537: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,3]<stderr>:2021-10-12 20:59:07.382977: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,3]<stderr>:2021-10-12 20:59:07.410488: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,1]<stderr>:2021-10-12 20:59:07.449489: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,1]<stderr>:2021-10-12 20:59:07.449640: W tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:729] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for %custom-call.1 = (f16[128,20,8,8]{1,3,2,0}, u8[0]{0}) custom-call(f16[128,10,12,12]{1,3,2,0} %convert.55, f16[20,10,5,5]{1,3,2,0} %convert.83), window={size=5x5}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm. 
[1,1]<stderr>:
[1,1]<stderr>:Convolution performance may be suboptimal.
[1,2]<stderr>:2021-10-12 20:59:07.451044: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.35MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
[1,2]<stderr>:2021-10-12 20:59:07.451089: W tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:729] Failed to determine best cudnn convolution algorithm: Resource exhausted: Out of memory while trying to allocate 17145856 bytes.
[1,2]<stderr>:
[1,2]<stderr>:Convolution performance may be suboptimal.
[1,7]<stderr>:2021-10-12 20:59:07.453232: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,1]<stderr>:2021-10-12 20:59:07.477870: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,6]<stderr>:2021-10-12 20:59:07.488471: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,1]<stderr>:2021-10-12 20:59:07.518287: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,1]<stderr>:2021-10-12 20:59:07.518425: W tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:729] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for %custom-call.1 = (f16[128,20,8,8]{1,3,2,0}, u8[0]{0}) custom-call(f16[128,10,12,12]{1,3,2,0} %convert.55, f16[20,10,5,5]{1,3,2,0} %convert.83), window={size=5x5}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm. 
[1,1]<stderr>:
[1,1]<stderr>:Convolution performance may be suboptimal.
[1,0]<stderr>:2021-10-12 20:59:07.528364: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,2]<stderr>:2021-10-12 20:59:07.540070: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.01MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
[1,2]<stderr>:2021-10-12 20:59:07.540141: W tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:729] Failed to determine best cudnn convolution algorithm: Resource exhausted: Out of memory while trying to allocate 16787216 bytes.
[1,2]<stderr>:
[1,2]<stderr>:Convolution performance may be suboptimal.
[1,1]<stderr>:2021-10-12 20:59:07.541127: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,1]<stderr>:2021-10-12 20:59:07.560350: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,6]<stderr>:2021-10-12 20:59:07.594085: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,1]<stderr>:2021-10-12 20:59:07.607617: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
[1,0]<stderr>:2021-10-12 20:59:07.624820: E tensorflow/stream_executor/cuda/cuda_dnn.cc:374] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
[1,2]<stderr>:2021-10-12 20:59:07.627340: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
[1,2]<stderr>:2021-10-12 20:59:07.627384: W tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:729] Failed to determine best cudnn convolution algorithm: Resource exhausted: Out of memory while trying to allocate 16777716 bytes.
[1,2]<stderr>:
[1,2]<stderr>:Convolution performance may be suboptimal.
[1,2]<stderr>:2021-10-12 20:59:07.711738: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.01MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
[1,2]<stderr>:2021-10-12 20:59:07.712823: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at xrt_compile_ops.cc:215 : Resource exhausted: Out of memory while trying to allocate 16791552 bytes.
[1,2]<stderr>:2021-10-12 20:59:07.734175: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] StackTrace:
[1,2]<stderr>:2021-10-12 20:59:07.734201: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** Begin stack trace ***
[1,2]<stderr>:2021-10-12 20:59:07.734208: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011tensorflow::CurrentStackTrace()
[1,2]<stderr>:2021-10-12 20:59:07.734215: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011xla::util::ReportComputationError(tensorflow::Status const&, absl::lts_20210324::Span<xla::XlaComputation const* const>, absl::lts_20210324::Span<xla::Shape const* const>)
[1,2]<stderr>:2021-10-12 20:59:07.734222: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011xla::XrtComputationClient::CheckCompileStatus(tensorflow::Status const&, std::vector<xla::ComputationClient::CompileInstance, std::allocator<xla::ComputationClient::CompileInstance> > const&, xla::XrtComputationClient::SessionWork const&)
[1,2]<stderr>:2021-10-12 20:59:07.734228: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011
[1,2]<stderr>:2021-10-12 20:59:07.734234: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011xla::util::MultiWait::Complete(std::function<void ()> const&)
[1,2]<stderr>:2021-10-12 20:59:07.734240: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011
[1,2]<stderr>:2021-10-12 20:59:07.734246: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011
[1,2]<stderr>:2021-10-12 20:59:07.734251: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011
[1,2]<stderr>:2021-10-12 20:59:07.734257: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011clone
[1,2]<stderr>:2021-10-12 20:59:07.734263: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** End stack trace ***
[1,2]<stderr>:2021-10-12 20:59:07.734268: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
[1,2]<stderr>:2021-10-12 20:59:07.734274: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Status: Resource exhausted: From /job:localservice/replica:0/task:4:
[1,2]<stderr>:2021-10-12 20:59:07.734280: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 2 root error(s) found.
[1,2]<stderr>:2021-10-12 20:59:07.734286: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (0) Resource exhausted: Out of memory while trying to allocate 16791552 bytes.
[1,2]<stderr>:2021-10-12 20:59:07.734291: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011 [[{{node XRTCompile}}]]
[1,2]<stderr>:2021-10-12 20:59:07.734298: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[1,2]<stderr>:2021-10-12 20:59:07.734303: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
[1,2]<stderr>:2021-10-12 20:59:07.734309: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (1) Resource exhausted: Out of memory while trying to allocate 16791552 bytes.
[1,2]<stderr>:2021-10-12 20:59:07.734315: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011 [[{{node XRTCompile}}]]
[1,2]<stderr>:2021-10-12 20:59:07.734320: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[1,2]<stderr>:2021-10-12 20:59:07.734326: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
[1,2]<stderr>:2021-10-12 20:59:07.734331: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] #011 [[XRTCompile_G3]]
[1,2]<stderr>:2021-10-12 20:59:07.734337: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[1,2]<stderr>:2021-10-12 20:59:07.734343: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
[1,2]<stderr>:2021-10-12 20:59:07.734349: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 successful operations.
[1,2]<stderr>:2021-10-12 20:59:07.734354: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 derived errors ignored.
[1,2]<stderr>:2021-10-12 20:59:07.734359: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Recent warning and error logs:
[1,2]<stderr>:2021-10-12 20:59:07.734365: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   Failed to determine best cudnn convolution algorithm: Resource exhausted: Out of memory while trying to allocate 16787216 bytes.
[1,2]<stderr>:2021-10-12 20:59:07.734371: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
[1,2]<stderr>:2021-10-12 20:59:07.734376: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Convolution performance may be suboptimal.
[1,2]<stderr>:2021-10-12 20:59:07.734382: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
[1,2]<stderr>:2021-10-12 20:59:07.734388: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   Failed to determine best cudnn convolution algorithm: Resource exhausted: Out of memory while trying to allocate 16777716 bytes.
[1,2]<stderr>:2021-10-12 20:59:07.734393: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
[1,2]<stderr>:2021-10-12 20:59:07.734398: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Convolution performance may be suboptimal.
[1,2]<stderr>:2021-10-12 20:59:07.734404: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.01MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
[1,2]<stderr>:2021-10-12 20:59:07.734410: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   OP_REQUIRES failed at xrt_compile_ops.cc:215 : Resource exhausted: Out of memory while trying to allocate 16791552 bytes.
[1,2]<stderr>:Exception in device=GPU:4: Resource exhausted: From /job:localservice/replica:0/task:4:
[1,2]<stderr>:2 root error(s) found.
[1,2]<stderr>:  (0) Resource exhausted: Out of memory while trying to allocate 16791552 bytes.
[1,2]<stderr>:#011 [[{{node XRTCompile}}]]
[1,2]<stderr>:Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[1,2]<stderr>:
[1,2]<stderr>:  (1) Resource exhausted: Out of memory while trying to allocate 16791552 bytes.
[1,2]<stderr>:#011 [[{{node XRTCompile}}]]
[1,2]<stderr>:Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

Environment

Additional context

The BFC Allocator allocates a large chunk of initial memory. Is it possible that multi-node training adds overhead on top of that which exceeds the GPU memory capacity. One way to verify this hypothesis is to reduce the size of that initial memory initialization. Other thoughts and ideas are appreciated.

codeislife99 commented 2 years ago

@yaochengji - @JackCaoG mentioned that you had worked on this issue previously. Do you have any insights on this ?

yaochengji commented 2 years ago

Hi @codeislife99, I didn't use xmp.spawn directly. What I did was setup the torch/xla environment variables mannualy, such as MP_DEVICE and LOCAL_WORKER, and call python's multi-process launch function.

For your problem, I guess it is because each process actually ran on the same GPU. To confirm, you could export TF_FORCE_GPU_ALLOW_GROWTH=1 and try the mnist example again.

codeislife99 commented 2 years ago

Yeah, using that environment variable the mnist example ran successfully. So this means that all the processes were on the same GPU and weren't actually using multiple GPUs. It's interesting because when I use the same script w/o the env variable to run single-node multi-GPU it successfully uses multiple GPUs( I verified this with GPU memory usage) and works without problems, only when I go to multi-node setup does it cause problems. Do you have an example script that I can follow along to make my multi-node setup work?

yaochengji commented 2 years ago

Oh, @codeislife99 , sorry that I forgot to reply to you. The env setup script is integrated in the product code and it could not be shared outside easily. What I suggest is that you could learn from torch_xla/distributed/xla_multiprocessing.py, where xmp.spawn wraps env setup and process launching code, and you could fix it for GPU distributed running.

codeislife99 commented 2 years ago

@yaochengji - So I have been trying to do that over the last few days. I have been using torch.distributed.launch to run the distributed process. I added dist.init_process_group('nccl',init_method='env://', rank=gindex, world_size = args[0].total_num_gpus) to _start_fn. However after doing this while the distributed setup works I am still not sure if the gradients are synchronized across the two nodes. Do you mind giving me some pointers or helping me out in this direction ? I would really appreciate it. Thanks.

yaochengji commented 2 years ago

@codeislife99 you could try the path below, it should work.

---
 third_party/xla_client/xrt_computation_client.cc | 15 ---------------
 torch_xla/distributed/xla_multiprocessing.py     | 11 +++--------
 2 files changed, 3 insertions(+), 23 deletions(-)

diff --git a/third_party/xla_client/xrt_computation_client.cc b/third_party/xla_client/xrt_computation_client.cc
index 0d65f3f3..d9a28740 100644
--- a/third_party/xla_client/xrt_computation_client.cc
+++ b/third_party/xla_client/xrt_computation_client.cc
@@ -1831,21 +1831,6 @@ tensorflow::ConfigProto XrtComputationClient::CreateConfigProto(
     const Options& options) {
   static const std::string* const grpc_proto = new std::string("grpc://");
   tensorflow::ConfigProto config;
-  if (options.workers_map.size() > 1) {
-    tensorflow::ClusterDef* cluster_def = config.mutable_cluster_def();
-    std::map<std::string, tensorflow::JobDef*> jobs;
-    for (auto& worker_target : options.workers_map) {
-      auto it = jobs.find(worker_target.first.name);
-      if (it == jobs.end()) {
-        tensorflow::JobDef* job = cluster_def->add_job();
-        job->set_name(worker_target.first.name);
-        it = jobs.emplace(worker_target.first.name, job).first;
-      }
-      tensorflow::JobDef* job = it->second;
-      (*job->mutable_tasks())[worker_target.first.task_no] =
-          StripPrefix(worker_target.second, *grpc_proto);
-    }
-  }
   return config;
 }

diff --git a/torch_xla/distributed/xla_multiprocessing.py b/torch_xla/distributed/xla_multiprocessing.py
index e67d92d8..aae4e815 100644
--- a/torch_xla/distributed/xla_multiprocessing.py
+++ b/torch_xla/distributed/xla_multiprocessing.py
@@ -207,17 +207,12 @@ def _pre_fork_setup(num_devices):
         socket.getfqdn(),
         xu.get_free_tcp_ports()[0])
   if dev_kind == 'GPU':
-    _setup_workers(num_devices)
-    _create_gpu_devices(num_devices)
-  elif dev_kind == 'CPU':
-    _pre_fork_cpu_setup(num_devices)
-  _pre_fork_setup_torch_distributed()
+    pass
   return PreForkConfig(dev_kind=dev_kind, num_devices=num_devices)

-def _setup_gpu_worker(index, gindex):
-  os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(
-      _get_mp_device_ordinal(index, gindex))
+def _setup_gpu_worker(index, gindex, pf_cfg):
+  os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(gindex)
   os.environ[xenv.LOCAL_WORKER] = '{}:{}'.format(_LOCAL_WORKER, gindex)
   # Every process is restricted to 1 GPU device, which in such process will be
   # named XLA_GPU:0.
--
codeislife99 commented 2 years ago

Hey @yaochengji , thanks for your help. I have a small question. Are you using torch.distributed.launch or torchrun for multi-node training ? I ask because whenever I use that it forces me to use dist.init_process_group(...) or else throws an error saying it was not found. Another question is for the model wrapper model = nn.parallel.DistributedDataParallel(model, device_ids=[<device_id>]) When I do this with , it says Tensors must be CUDA or dense . Are you using xla's data parallel wrapper in this way as outlined in the test https://github.com/pytorch/xla/blob/master/test/test_operations.py#L609 because this data parallel wrapper seems to be similar to DataParallel from torch which is only for single node multi GPU.

Do I have to create my own nn.parallel.DistributedDataParallel wrapper for multi node multi GPU cc: @JackCaoG ?

codeislife99 commented 2 years ago

When I try to comment out the TORCH_CHECK for Tensors must be CUDA or dense and other checks since we are running on XLA, I get XLA tensors do not have storage. So I am not sure if the DistributedDataParallel module is even usable with torch xla. If not , is there any replacement for it within XLA ?
Seems that much of this is because native PT tensors are defined within reducer.cpp. These would need to be bridged to XLATensors and potentially the DDP module in native PT would need quite a bit of rewrite to be compatible for XLA. Has anyone attempted this before or am I wrong here ?

yaochengji commented 2 years ago

Hey @yaochengji , thanks for your help. I have a small question. Are you using torch.distributed.launch or torchrun for multi-node training ? I ask because whenever I use that it forces me to use dist.init_process_group(...) or else throws an error saying it was not found. Another question is for the model wrapper model = nn.parallel.DistributedDataParallel(model, device_ids=[<device_id>]) When I do this with , it says Tensors must be CUDA or dense . Are you using xla's data parallel wrapper in this way as outlined in the test master/test/test_operations.py#L609 because this data parallel wrapper seems to be similar to DataParallel from torch which is only for single node multi GPU.

Do I have to create my own nn.parallel.DistributedDataParallel wrapper for multi node multi GPU cc: @JackCaoG ?

Hi @codeislife99 , I didn't use torch.distributed.launch or DistributedDataParallel. I didn't use much of the pytorch/xla distributed wrappers.

cicirori commented 2 years ago

@codeislife99 you could try the path below, it should work.

---
 third_party/xla_client/xrt_computation_client.cc | 15 ---------------
 torch_xla/distributed/xla_multiprocessing.py     | 11 +++--------
 2 files changed, 3 insertions(+), 23 deletions(-)

diff --git a/third_party/xla_client/xrt_computation_client.cc b/third_party/xla_client/xrt_computation_client.cc
index 0d65f3f3..d9a28740 100644
--- a/third_party/xla_client/xrt_computation_client.cc
+++ b/third_party/xla_client/xrt_computation_client.cc
@@ -1831,21 +1831,6 @@ tensorflow::ConfigProto XrtComputationClient::CreateConfigProto(
     const Options& options) {
   static const std::string* const grpc_proto = new std::string("grpc://");
   tensorflow::ConfigProto config;
-  if (options.workers_map.size() > 1) {
-    tensorflow::ClusterDef* cluster_def = config.mutable_cluster_def();
-    std::map<std::string, tensorflow::JobDef*> jobs;
-    for (auto& worker_target : options.workers_map) {
-      auto it = jobs.find(worker_target.first.name);
-      if (it == jobs.end()) {
-        tensorflow::JobDef* job = cluster_def->add_job();
-        job->set_name(worker_target.first.name);
-        it = jobs.emplace(worker_target.first.name, job).first;
-      }
-      tensorflow::JobDef* job = it->second;
-      (*job->mutable_tasks())[worker_target.first.task_no] =
-          StripPrefix(worker_target.second, *grpc_proto);
-    }
-  }
   return config;
 }

diff --git a/torch_xla/distributed/xla_multiprocessing.py b/torch_xla/distributed/xla_multiprocessing.py
index e67d92d8..aae4e815 100644
--- a/torch_xla/distributed/xla_multiprocessing.py
+++ b/torch_xla/distributed/xla_multiprocessing.py
@@ -207,17 +207,12 @@ def _pre_fork_setup(num_devices):
         socket.getfqdn(),
         xu.get_free_tcp_ports()[0])
   if dev_kind == 'GPU':
-    _setup_workers(num_devices)
-    _create_gpu_devices(num_devices)
-  elif dev_kind == 'CPU':
-    _pre_fork_cpu_setup(num_devices)
-  _pre_fork_setup_torch_distributed()
+    pass
   return PreForkConfig(dev_kind=dev_kind, num_devices=num_devices)

-def _setup_gpu_worker(index, gindex):
-  os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(
-      _get_mp_device_ordinal(index, gindex))
+def _setup_gpu_worker(index, gindex, pf_cfg):
+  os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(gindex)
   os.environ[xenv.LOCAL_WORKER] = '{}:{}'.format(_LOCAL_WORKER, gindex)
   # Every process is restricted to 1 GPU device, which in such process will be
   # named XLA_GPU:0.
--

Hi, @yaochengji. Thanks for your code. I tried to arrange environment variables manually to launch training process. It works well without xmp.spawn. But I didn't do C++ code change above. I wonder why you delete them? Is there something I didn't notice?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mika-gu commented 2 years ago

@codeislife99 Have you solve this problem? I get the same question. After I set XRT_DEVICE_MAP & XRT_LOCAL_WORKER, and edit file: xla_client/xrt_computation_client.cc xla_multiprocessing.py

error log: RuntimeError: tensorflow/compiler/xla/xla_client/meshservice.cc:329 : Check failed: impl->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))

codeislife99 commented 2 years ago

Yes, I am able to run multi gpu and multi node training successfully. There is no good documentation around it unfortunately , so I can see understand your trouble. You need to make changes to certain environment variables. Right now , I am a bit busy but I can share the env variable changes here after that.

mika-gu commented 2 years ago

Yes, I am able to run multi gpu and multi node training successfully. There is no good documentation around it unfortunately , so I can see understand your trouble. You need to make changes to certain environment variables. Right now , I am a bit busy but I can share the env variable changes here after that.

@codeislife99 Thank you very much for your share. Expect your reply.

codeislife99 commented 2 years ago

Assuming you have two nodes with 4 GPUs on each node. I first make the following patch to xla_multiprocessing.py and then configure a number of environment variables.

--- a/torch_xla/distributed/xla_multiprocessing.py
+++ b/torch_xla/distributed/xla_multiprocessing.py
@@ -154,13 +154,13 @@ def _setup_workers(num_devices):
         wcfg), 'World size ({}) must match the configured workers ({})'.format(
             world_size, len(wcfg))
     for h, worker in enumerate(wcfg):
-      m = re.match(r'(.*):(\d+)$', worker.host_port)
+      m = re.match(r'(.*):(\d+)$', wcfg[worker].host_port)
       if not m:
         raise RuntimeError('Bad worker HOST:PORT format: {}'.format(
             worker.host_port))
       for i in range(0, num_devices):
         gindex = h * num_devices + i
-        workers.append('{}:{};grpc://{}:{}'.format(worker.worker_name, gindex,
+        workers.append('{}:{};grpc://{}:{}'.format(wcfg[worker].worker_name, gindex,
                                                    m.group(1),
                                                    int(m.group(2)) + i))
   else:
@@ -216,8 +216,7 @@ def _pre_fork_setup(num_devices):

 def _setup_gpu_worker(index, gindex):
-  os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(
-      _get_mp_device_ordinal(index, gindex))
+  os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(gindex)
   os.environ[xenv.LOCAL_WORKER] = '{}:{}'.format(_LOCAL_WORKER, gindex)
   # Every process is restricted to 1 GPU device, which in such process will be
   # named XLA_GPU:0.

On both nodes:

export GPU_NUM_DEVICES=4
export XRT_SHARD_WORLD_SIZE=2
export XRT_MESH_SERVICE_ADDRESS="ip-172-31-31-102.us-west-2.compute.internal:53957"
export XRT_WORKERS="localservice:0;34.219.116.56:56747|localservice:1;34.222.146.81:56748"

On First Node: export XRT_HOST_ORDINAL=0 On Second Nodes: export XRT_HOST_ORDINAL=1

python3 test_train_mp_mnist.py --num_worker 0

mika-gu commented 2 years ago

@codeislife99 Got it!

mika-gu commented 2 years ago

Finally I solved the multi-node training problem. Thanks for @codeislife99 help.

code:

diff --git a/torch_xla/distributed/xla_multiprocessing.py b/torch_xla/distributed/xla_multiprocessing.py
index a9a3955..e2cdae2 100644
--- a/torch_xla/distributed/xla_multiprocessing.py
+++ b/torch_xla/distributed/xla_multiprocessing.py
@@ -208,17 +208,17 @@ def _pre_fork_setup(num_devices):
         socket.getfqdn(),
         xu.get_free_tcp_ports()[0])
   if dev_kind == 'GPU':
-    _setup_workers(num_devices)
-    _create_gpu_devices(num_devices)
-  elif dev_kind == 'CPU':
-    _pre_fork_cpu_setup(num_devices)
-  _pre_fork_setup_torch_distributed()
+#    _setup_workers(num_devices)
+#    _create_gpu_devices(num_devices)
+#  elif dev_kind == 'CPU':
+#    _pre_fork_cpu_setup(num_devices)
+#  _pre_fork_setup_torch_distributed()
+    pass
   return PreForkConfig(dev_kind=dev_kind, num_devices=num_devices)

 def _setup_gpu_worker(index, gindex):
-  os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(
-      _get_mp_device_ordinal(index, gindex))
+  os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(gindex)
   os.environ[xenv.LOCAL_WORKER] = '{}:{}'.format(_LOCAL_WORKER, gindex)
   # Every process is restricted to 1 GPU device, which in such process will be
   # named XLA_GPU:0.

env:

os.environ['XRT_WORKERS'] = "localservice:0;grpc://33.64.64.12:46761|localservice:1;grpc://33.64.64.113:36607" 
os.environ['GPU_NUM_DEVICES'] = '1'

os.environ['XRT_DEVICE_MAP'] = "GPU:0;/job:localservice/replica:0/task:0/device:XLA_GPU:0|" \
                               "GPU:1;/job:localservice/replica:0/task:1/device:XLA_GPU:0"

hosts = os.environ["WORLD_INFO"].split(",")[0]
ip = hosts.split(":")[0]
port = int(hosts.split(":")[1]) + 1
os.environ["XRT_MESH_SERVICE_ADDRESS"] = f"{ip}:{port}"

os.environ['XRT_HOST_ORDINAL'] = os.environ['RANK']
os.environ['XRT_LOCAL_WORKER'] = 'localservice:' + os.environ['RANK']
os.environ['XRT_SHARD_WORLD_SIZE'] = os.environ['WORLD_SIZE']