Closed codeislife99 closed 2 years ago
@yaochengji - @JackCaoG mentioned that you had worked on this issue previously. Do you have any insights on this ?
Hi @codeislife99, I didn't use xmp.spawn
directly. What I did was setup the torch/xla environment variables mannualy, such as MP_DEVICE
and LOCAL_WORKER
, and call python's multi-process launch function.
For your problem, I guess it is because each process actually ran on the same GPU. To confirm, you could export TF_FORCE_GPU_ALLOW_GROWTH=1
and try the mnist example again.
Yeah, using that environment variable the mnist example ran successfully. So this means that all the processes were on the same GPU and weren't actually using multiple GPUs. It's interesting because when I use the same script w/o the env variable to run single-node multi-GPU it successfully uses multiple GPUs( I verified this with GPU memory usage) and works without problems, only when I go to multi-node setup does it cause problems. Do you have an example script that I can follow along to make my multi-node setup work?
Oh, @codeislife99 , sorry that I forgot to reply to you. The env setup script is integrated in the product code and it could not be shared outside easily. What I suggest is that you could learn from torch_xla/distributed/xla_multiprocessing.py
, where xmp.spawn
wraps env setup and process launching code, and you could fix it for GPU distributed running.
@yaochengji - So I have been trying to do that over the last few days. I have been using torch.distributed.launch
to run the distributed process. I added dist.init_process_group('nccl',init_method='env://', rank=gindex, world_size = args[0].total_num_gpus)
to _start_fn
. However after doing this while the distributed setup works I am still not sure if the gradients are synchronized across the two nodes. Do you mind giving me some pointers or helping me out in this direction ? I would really appreciate it. Thanks.
@codeislife99 you could try the path below, it should work.
---
third_party/xla_client/xrt_computation_client.cc | 15 ---------------
torch_xla/distributed/xla_multiprocessing.py | 11 +++--------
2 files changed, 3 insertions(+), 23 deletions(-)
diff --git a/third_party/xla_client/xrt_computation_client.cc b/third_party/xla_client/xrt_computation_client.cc
index 0d65f3f3..d9a28740 100644
--- a/third_party/xla_client/xrt_computation_client.cc
+++ b/third_party/xla_client/xrt_computation_client.cc
@@ -1831,21 +1831,6 @@ tensorflow::ConfigProto XrtComputationClient::CreateConfigProto(
const Options& options) {
static const std::string* const grpc_proto = new std::string("grpc://");
tensorflow::ConfigProto config;
- if (options.workers_map.size() > 1) {
- tensorflow::ClusterDef* cluster_def = config.mutable_cluster_def();
- std::map<std::string, tensorflow::JobDef*> jobs;
- for (auto& worker_target : options.workers_map) {
- auto it = jobs.find(worker_target.first.name);
- if (it == jobs.end()) {
- tensorflow::JobDef* job = cluster_def->add_job();
- job->set_name(worker_target.first.name);
- it = jobs.emplace(worker_target.first.name, job).first;
- }
- tensorflow::JobDef* job = it->second;
- (*job->mutable_tasks())[worker_target.first.task_no] =
- StripPrefix(worker_target.second, *grpc_proto);
- }
- }
return config;
}
diff --git a/torch_xla/distributed/xla_multiprocessing.py b/torch_xla/distributed/xla_multiprocessing.py
index e67d92d8..aae4e815 100644
--- a/torch_xla/distributed/xla_multiprocessing.py
+++ b/torch_xla/distributed/xla_multiprocessing.py
@@ -207,17 +207,12 @@ def _pre_fork_setup(num_devices):
socket.getfqdn(),
xu.get_free_tcp_ports()[0])
if dev_kind == 'GPU':
- _setup_workers(num_devices)
- _create_gpu_devices(num_devices)
- elif dev_kind == 'CPU':
- _pre_fork_cpu_setup(num_devices)
- _pre_fork_setup_torch_distributed()
+ pass
return PreForkConfig(dev_kind=dev_kind, num_devices=num_devices)
-def _setup_gpu_worker(index, gindex):
- os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(
- _get_mp_device_ordinal(index, gindex))
+def _setup_gpu_worker(index, gindex, pf_cfg):
+ os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(gindex)
os.environ[xenv.LOCAL_WORKER] = '{}:{}'.format(_LOCAL_WORKER, gindex)
# Every process is restricted to 1 GPU device, which in such process will be
# named XLA_GPU:0.
--
Hey @yaochengji , thanks for your help. I have a small question. Are you using torch.distributed.launch
or torchrun
for multi-node training ? I ask because whenever I use that it forces me to use dist.init_process_group(...)
or else throws an error saying it was not found. Another question is for the model wrapper
model = nn.parallel.DistributedDataParallel(model, device_ids=[<device_id>])
When I do this with , it says Tensors must be CUDA or dense
. Are you using xla's data parallel wrapper in this way as outlined in the test https://github.com/pytorch/xla/blob/master/test/test_operations.py#L609 because this data parallel wrapper seems to be similar to DataParallel
from torch which is only for single node multi GPU.
Do I have to create my own nn.parallel.DistributedDataParallel
wrapper for multi node multi GPU cc: @JackCaoG ?
When I try to comment out the TORCH_CHECK
for Tensors must be CUDA or dense
and other checks since we are running on XLA, I get XLA tensors do not have storage
. So I am not sure if the DistributedDataParallel
module is even usable with torch xla. If not , is there any replacement for it within XLA ?
Seems that much of this is because native PT tensors are defined within reducer.cpp
. These would need to be bridged to XLATensors and potentially the DDP module in native PT would need quite a bit of rewrite to be compatible for XLA. Has anyone attempted this before or am I wrong here ?
Hey @yaochengji , thanks for your help. I have a small question. Are you using
torch.distributed.launch
ortorchrun
for multi-node training ? I ask because whenever I use that it forces me to usedist.init_process_group(...)
or else throws an error saying it was not found. Another question is for the model wrappermodel = nn.parallel.DistributedDataParallel(model, device_ids=[<device_id>])
When I do this with , it saysTensors must be CUDA or dense
. Are you using xla's data parallel wrapper in this way as outlined in the testmaster
/test/test_operations.py#L609 because this data parallel wrapper seems to be similar toDataParallel
from torch which is only for single node multi GPU.Do I have to create my own
nn.parallel.DistributedDataParallel
wrapper for multi node multi GPU cc: @JackCaoG ?
Hi @codeislife99 , I didn't use torch.distributed.launch
or DistributedDataParallel
. I didn't use much of the pytorch/xla distributed wrappers.
@codeislife99 you could try the path below, it should work.
--- third_party/xla_client/xrt_computation_client.cc | 15 --------------- torch_xla/distributed/xla_multiprocessing.py | 11 +++-------- 2 files changed, 3 insertions(+), 23 deletions(-) diff --git a/third_party/xla_client/xrt_computation_client.cc b/third_party/xla_client/xrt_computation_client.cc index 0d65f3f3..d9a28740 100644 --- a/third_party/xla_client/xrt_computation_client.cc +++ b/third_party/xla_client/xrt_computation_client.cc @@ -1831,21 +1831,6 @@ tensorflow::ConfigProto XrtComputationClient::CreateConfigProto( const Options& options) { static const std::string* const grpc_proto = new std::string("grpc://"); tensorflow::ConfigProto config; - if (options.workers_map.size() > 1) { - tensorflow::ClusterDef* cluster_def = config.mutable_cluster_def(); - std::map<std::string, tensorflow::JobDef*> jobs; - for (auto& worker_target : options.workers_map) { - auto it = jobs.find(worker_target.first.name); - if (it == jobs.end()) { - tensorflow::JobDef* job = cluster_def->add_job(); - job->set_name(worker_target.first.name); - it = jobs.emplace(worker_target.first.name, job).first; - } - tensorflow::JobDef* job = it->second; - (*job->mutable_tasks())[worker_target.first.task_no] = - StripPrefix(worker_target.second, *grpc_proto); - } - } return config; } diff --git a/torch_xla/distributed/xla_multiprocessing.py b/torch_xla/distributed/xla_multiprocessing.py index e67d92d8..aae4e815 100644 --- a/torch_xla/distributed/xla_multiprocessing.py +++ b/torch_xla/distributed/xla_multiprocessing.py @@ -207,17 +207,12 @@ def _pre_fork_setup(num_devices): socket.getfqdn(), xu.get_free_tcp_ports()[0]) if dev_kind == 'GPU': - _setup_workers(num_devices) - _create_gpu_devices(num_devices) - elif dev_kind == 'CPU': - _pre_fork_cpu_setup(num_devices) - _pre_fork_setup_torch_distributed() + pass return PreForkConfig(dev_kind=dev_kind, num_devices=num_devices) -def _setup_gpu_worker(index, gindex): - os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format( - _get_mp_device_ordinal(index, gindex)) +def _setup_gpu_worker(index, gindex, pf_cfg): + os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(gindex) os.environ[xenv.LOCAL_WORKER] = '{}:{}'.format(_LOCAL_WORKER, gindex) # Every process is restricted to 1 GPU device, which in such process will be # named XLA_GPU:0. --
Hi, @yaochengji. Thanks for your code. I tried to arrange environment variables manually to launch training process. It works well without xmp.spawn. But I didn't do C++ code change above. I wonder why you delete them? Is there something I didn't notice?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@codeislife99 Have you solve this problem? I get the same question. After I set XRT_DEVICE_MAP & XRT_LOCAL_WORKER, and edit file: xla_client/xrt_computation_client.cc xla_multiprocessing.py
error log: RuntimeError: tensorflow/compiler/xla/xla_client/meshservice.cc:329 : Check failed: impl->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
Yes, I am able to run multi gpu and multi node training successfully. There is no good documentation around it unfortunately , so I can see understand your trouble. You need to make changes to certain environment variables. Right now , I am a bit busy but I can share the env variable changes here after that.
Yes, I am able to run multi gpu and multi node training successfully. There is no good documentation around it unfortunately , so I can see understand your trouble. You need to make changes to certain environment variables. Right now , I am a bit busy but I can share the env variable changes here after that.
@codeislife99 Thank you very much for your share. Expect your reply.
Assuming you have two nodes with 4 GPUs on each node. I first make the following patch to xla_multiprocessing.py
and then configure a number of environment variables.
--- a/torch_xla/distributed/xla_multiprocessing.py
+++ b/torch_xla/distributed/xla_multiprocessing.py
@@ -154,13 +154,13 @@ def _setup_workers(num_devices):
wcfg), 'World size ({}) must match the configured workers ({})'.format(
world_size, len(wcfg))
for h, worker in enumerate(wcfg):
- m = re.match(r'(.*):(\d+)$', worker.host_port)
+ m = re.match(r'(.*):(\d+)$', wcfg[worker].host_port)
if not m:
raise RuntimeError('Bad worker HOST:PORT format: {}'.format(
worker.host_port))
for i in range(0, num_devices):
gindex = h * num_devices + i
- workers.append('{}:{};grpc://{}:{}'.format(worker.worker_name, gindex,
+ workers.append('{}:{};grpc://{}:{}'.format(wcfg[worker].worker_name, gindex,
m.group(1),
int(m.group(2)) + i))
else:
@@ -216,8 +216,7 @@ def _pre_fork_setup(num_devices):
def _setup_gpu_worker(index, gindex):
- os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(
- _get_mp_device_ordinal(index, gindex))
+ os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(gindex)
os.environ[xenv.LOCAL_WORKER] = '{}:{}'.format(_LOCAL_WORKER, gindex)
# Every process is restricted to 1 GPU device, which in such process will be
# named XLA_GPU:0.
On both nodes:
export GPU_NUM_DEVICES=4
export XRT_SHARD_WORLD_SIZE=2
export XRT_MESH_SERVICE_ADDRESS="ip-172-31-31-102.us-west-2.compute.internal:53957"
export XRT_WORKERS="localservice:0;34.219.116.56:56747|localservice:1;34.222.146.81:56748"
On First Node:
export XRT_HOST_ORDINAL=0
On Second Nodes:
export XRT_HOST_ORDINAL=1
python3 test_train_mp_mnist.py --num_worker 0
@codeislife99 Got it!
Finally I solved the multi-node training problem. Thanks for @codeislife99 help.
code:
diff --git a/torch_xla/distributed/xla_multiprocessing.py b/torch_xla/distributed/xla_multiprocessing.py
index a9a3955..e2cdae2 100644
--- a/torch_xla/distributed/xla_multiprocessing.py
+++ b/torch_xla/distributed/xla_multiprocessing.py
@@ -208,17 +208,17 @@ def _pre_fork_setup(num_devices):
socket.getfqdn(),
xu.get_free_tcp_ports()[0])
if dev_kind == 'GPU':
- _setup_workers(num_devices)
- _create_gpu_devices(num_devices)
- elif dev_kind == 'CPU':
- _pre_fork_cpu_setup(num_devices)
- _pre_fork_setup_torch_distributed()
+# _setup_workers(num_devices)
+# _create_gpu_devices(num_devices)
+# elif dev_kind == 'CPU':
+# _pre_fork_cpu_setup(num_devices)
+# _pre_fork_setup_torch_distributed()
+ pass
return PreForkConfig(dev_kind=dev_kind, num_devices=num_devices)
def _setup_gpu_worker(index, gindex):
- os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(
- _get_mp_device_ordinal(index, gindex))
+ os.environ[xenv.MP_DEVICE] = 'GPU:{}'.format(gindex)
os.environ[xenv.LOCAL_WORKER] = '{}:{}'.format(_LOCAL_WORKER, gindex)
# Every process is restricted to 1 GPU device, which in such process will be
# named XLA_GPU:0.
env:
os.environ['XRT_WORKERS'] = "localservice:0;grpc://33.64.64.12:46761|localservice:1;grpc://33.64.64.113:36607"
os.environ['GPU_NUM_DEVICES'] = '1'
os.environ['XRT_DEVICE_MAP'] = "GPU:0;/job:localservice/replica:0/task:0/device:XLA_GPU:0|" \
"GPU:1;/job:localservice/replica:0/task:1/device:XLA_GPU:0"
hosts = os.environ["WORLD_INFO"].split(",")[0]
ip = hosts.split(":")[0]
port = int(hosts.split(":")[1]) + 1
os.environ["XRT_MESH_SERVICE_ADDRESS"] = f"{ip}:{port}"
os.environ['XRT_HOST_ORDINAL'] = os.environ['RANK']
os.environ['XRT_LOCAL_WORKER'] = 'localservice:' + os.environ['RANK']
os.environ['XRT_SHARD_WORLD_SIZE'] = os.environ['WORLD_SIZE']
🐛 Bug
Running XLA MultiGPU MultiNode configuration fails with XRT OOM for all models / configurations (including MNIST)
To Reproduce
Steps to reproduce the behavior:
test_train_mp_mnist.py
script intest
folderEnvironment
Additional context
The BFC Allocator allocates a large chunk of initial memory. Is it possible that multi-node training adds overhead on top of that which exceeds the GPU memory capacity. One way to verify this hypothesis is to reduce the size of that initial memory initialization. Other thoughts and ideas are appreciated.