Open codeislife99 opened 2 years ago
@codeislife99 FYI previous thread about the same issue https://github.com/pytorch/xla/issues/3347. It looks like pytorch recently requires CUDA_VISIBLE_DEVICES
to be set prior to importing torch.
@JackCaoG Do you have any insights on this issue?
Hmm... can you set CUDA_VISIBLE_DEVICES
before importing pytorch? In our use case we always use non-cuda version of the pytorch so we have not seen this issue before. Is it inconvenient because figuring out CUDA_VISIBLE_DEVICES
is non-trival per process?
How does pytorch solve it through, do they not have n process for n GPU?
PyTorch can do it because they are not doing distributed training using the trick XLA is doing it. For them it's native to torch itself, they are using DDP modules instead of setting a different visible device to each device. Can you elaborate on how CUDA_VISIBLE_DEVICES can be set prior to importing torch ? the code is in such a way that the processes are spawned first and then the CUDA_DEVICE is attached to them
well, we also have a ddo module contributed by @hjm-aws in https://github.com/pytorch/xla/blob/master/torch_xla/distributed/xla_backend.py and I think we get it to work with gpu at some point. However AFAIK this ddp module is forward the cc ops to xla python version of cc ops.
alternative is to use the non-gpu version of the pytorch. We are trying to migrate to the new pjrt runtime, the CUDA_VISIBLE_DEVICES
trick was mostly for XRT. We haven't plan on the PJRT:GPU path too much yet. @cicirori in case you run into the same issue.
@codeislife99 It looks like this is a bug in pytorch 1.12 and has been fixed now https://github.com/pytorch/pytorch/issues/80876. We need to figure out how to not rely on CUDA_VISIBLE_DEVICES
to set device in the future as Jack suggested.
Excellent! That should solve the problem for now. I will try it out and update it here. But should work for sure. It is true that the trick of CUDA_VISIBLE_DEVICES seems like a hack and a better solution would improve the robustness of the code.
🐛 Bug
When PyTorch is compiled with CUDA enabled and torch_xla is compiled afterwards, then while running distributed training the following OOM error shows up along with it launching nxn processes.
https://logpaste.com/Bld77rSW
To Reproduce
GPU_NUM_DEVICES=8 python3 test_train_mp_mnist_amp.py --fake_data
fromxla/test
folderExpected behavior
The script should launch 8 processes but launches 64 of them, 8 on each device, and crashes with an OOM error whose details can be seen in this log paste.
https://logpaste.com/Bld77rSW
Environment
PyTorch 1.12 compiled with CUDA PT-XLA 1.12 compiled with CUDA
Additional context
Single Node distributed training fails with 1.12 binaries. An n-gpu process launches nxn processes. The reason for this is that starting from 1.12,
import torch
directly initializes cuda on all visible devices. Prior to this, you could set the env variable within the code and trick each process to believe that there is only one GPU available. XLA uses this trick to do single node distributed training as seen here . Essentiallyimport torch
exposes all the CUDA devices which means setting CUDA_VISIBLE_DEVICES post facto is irrelevant. This implies that every process can now see every other device , and is launching n processes internally, but there are n of them , so its nXn. Prior to this every process was only seeing the CUDA device set by CUDA_VISIBLE_DEVICES in each particular process , so process 0 , would see 0, 2 would see 2 and so on. But now process 0 is seeing 0-n-1 , 1 is seeing 1-n-1 and so on. This has happened because PyTorch upstream has changed when cuda devices are initialized to the top-level(import torch
) instead of when cuda devices are actually used or tensors are allocated, and this broke the assumption that xla was using to do single node distributed training. This error does NOT exist in the images released publicly because they are built with PyTorch + CPU and not PyTorch + CUDA, which means that when they doimport torch
it doesn't check CUDA devices. .Hence this is a problem unique to CUDA compilation of PyTorch. I think the solution here is to change how single node distributed training pathway works and GPU IDs are initialized and set but wanted to have a broader discussion around that.
Special Thanks to @ymwangg for productive discussions on this issue.
Appendix
Consider this script which when run on 8 GPUs outputs 8 instead of 4.
I think this is the reason of why so many processes are being spawned.