Closed klshrinidhi closed 6 years ago
I have also experienced DataParallel
hanging in 0.4.
Collecting environment information...
PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 9.1.85
OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
CMake version: Could not collect
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti
Nvidia driver version: 390.48
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy (1.14.2)
[pip3] torch (0.4.0)
[pip3] torchvision (0.2.1)
[conda] Could not collect
Extra info: I'm not using conda, and it seemingly only hangs when using 4, not 2 GPUs.
cc: @teng-li who knows all NCCL2 caveats cc: @ngimel who knows everything about NVIDIA-things
Debugging a little bit more, I found that the function call ncclCommInitAll(comms.get(), devices.size(), devices.data())
located at https://github.com/pytorch/pytorch/blob/master/torch/csrc/cuda/nccl.cpp#L29 is the one getting stuck.
I am happy to work on a fix if I get some pointers. Thank you !!
Could not repro neither on 2,4, nor 8 GPUs, neither in environment where there's no cuda libraries installed, only pytorch conda package, nor with existing toolkit installation.
[root@8017507a5cad playground]# python collect_env.py
Collecting environment information...
PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 8.0.61
OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
CMake version: version 2.8.12.2
Python version: 2.7
Is CUDA available: Yes
CUDA runtime version: 8.0.61
GPU models and configuration:
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB
GPU 4: Tesla P100-SXM2-16GB
GPU 5: Tesla P100-SXM2-16GB
GPU 6: Tesla P100-SXM2-16GB
GPU 7: Tesla P100-SXM2-16GB
Nvidia driver version: 384.125
cuDNN version: Probably one of the following:
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.7.1.2
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn_static.a
/usr/local/cuda-9.0/lib64/libcudnn.so.7.1.2
/usr/local/cuda-9.0/lib64/libcudnn_static.a
/usr/local/cuda-9.1/lib64/libcudnn.so.7.1.2
/usr/local/cuda-9.1/lib64/libcudnn_static.a
Versions of relevant libraries:
[pip] numpy (1.14.2)
[pip] torch (0.4.0)
[pip] torchvision (0.2.1)
[conda] pytorch 0.4.0 py27_cuda8.0.61_cudnn7.1.2_1 pytorch
[conda] torchvision 0.2.1 py27_1 pytorch
I also can't repro with pytorch cuda 9.1 + 390.30 driver. Can people who can repro please run with export NCCL_DEBUG=INFO
and post the output?
Hello @ngimel, Sorry for the delay. Here is the output I see for the snippet I posted. NOTE: I have replaced hostname, port, ip-address.
hostname:portnum [0] INFO Using internal Network Socket
hostname:portnum [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
NCCL version 2.1.15+cuda8.0
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/tmp/6VmhvL/62:/tmp/6VmhvL/61:/tmp/6VmhvL/60:/tmp/6VmhvL/59:/tmp/6VmhvL/58:/tmp/6VmhvL/57:/tmp/6VmhvL/56:/tmp/6VmhvL/55:/tmp/6VmhvL/54:/tmp/6VmhvL/53:/tmp/6VmhvL\
/52:/tmp/6VmhvL/51:/tmp/6VmhvL/50:/tmp/6VmhvL/49:/tmp/6VmhvL/48:/tmp/6VmhvL/47:/tmp/6VmhvL/46:/tmp/6VmhvL/45:/tmp/6VmhvL/44:/tmp/6VmhvL/43:/tmp/6VmhvL/42:/tmp/6VmhvL/41:/tmp/6VmhvL/40:/tmp/6VmhvL/39:/tmp/6VmhvL/38:/tmp/6VmhvL/37:/tmp/6Vm\
hvL/36:/tmp/6VmhvL/35:/tmp/6VmhvL/34:/tmp/6VmhvL/33:/tmp/6VmhvL/32:/tmp/6V'
Unexpected end of /proc/mounts line `mhvL/31:/tmp/6VmhvL/30:/tmp/6VmhvL/29:/tmp/6VmhvL/28:/tmp/6VmhvL/27:/tmp/6VmhvL/26:/tmp/6VmhvL/25:/tmp/6VmhvL/24:/tmp/6VmhvL/23:/tmp/6VmhvL/22:/tmp/6VmhvL/21:/tmp/6VmhvL/20:/tmp/6VmhvL/19:/tmp/6VmhvL/\
18:/tmp/6VmhvL/17:/tmp/6VmhvL/16:/tmp/6VmhvL/15:/tmp/6VmhvL/14:/tmp/6VmhvL/13:/tmp/6VmhvL/12:/tmp/6VmhvL/11:/tmp/6VmhvL/10:/tmp/6VmhvL/9:/tmp/6VmhvL/8:/tmp/6VmhvL/7:/tmp/6VmhvL/6:/tmp/6VmhvL/5:/tmp/6VmhvL/4:/tmp/6VmhvL/3:/tmp/6VmhvL/2:/t\
mp/6VmhvL/1:/tmp/6VmhvL/0,upperdir=/mnt/01/mesos_work/provisioner/containe'
hostname:portnum [0] INFO NET : Using interface eth0:ipaddress<0>
hostname:portnum [0] INFO NET/Socket : 1 interfaces found
hostname:portnum [1] INFO Using 256 threads
hostname:portnum [1] INFO Min Comp Cap 6
hostname:portnum [1] INFO NCCL_SINGLE_RING_THRESHOLD=131072
hostname:portnum [1] INFO Ring 00 : 0 1
hostname:portnum [0] INFO 1 -> 0 via NET/Socket/0
hostname:portnum [1] INFO 0 -> 1 via NET/Socket/0
Hi @klshrinidhi -- there is indeed something wrong here, the 2 GPUs you are using are supposed to be on the same machine (and even managed within the same process), yet NCCL detects them as two different machines and tries to use NET/Socket (the trace should show P2P or SHM, but not NET).
Can you confirm that the "hostname:portnum"(*) is the same on both GPUs, i.e. on lines with [0] and lines with [1] ?
I also assume you don't set NCCL_SHM_DISABLE=1 and NCCL_P2P_DISABLE=1 ?
Anyway, feel free to open a bug on https://developer.nvidia.com/user (My bugs) and select "Deep Learning Toolkit / NCCL" as relevant area so that we can investigate further. Thanks !
(*) it is actually hostname:PID
Thanks @sjeaugey,
Yes. I just double checked that hostname:portnum
are exactly the same for both GPUs, which are physically on the same machine. Also checked that the env vars you mention are not set. The only NCCL env var I have set is NCCL_DEBUG=INFO.
I have reported the bug --> https://developer.nvidia.com/nvidia_bug/2111134
@sjeaugey pointed out that in my case the two vars NCCL_SHM_DISABLE=1 and NCCL_P2P_DISABLE=1 were set in /etc/nccl.conf. Removing these two lines solved the problem. I will close this issue now. Thanks @sjeaugey !!
I have this same problem with nccl. The code in the original post does not return. klshrinidhi's solution does not fix it for me. I do not have a /etc/nccl.conf in Ubuntu 18.04 but I did unset those environment variables and I get this:
NCCL version 2.1.15+cuda9.0 rig:2637:2637 [0] INFO NET : Using interface enp0s31f6:192.168.85.32<0> rig:2637:2637 [0] INFO NET/Socket : 1 interfaces found rig:2637:2637 [3] INFO Using 256 threads rig:2637:2637 [3] INFO Min Comp Cap 6 rig:2637:2637 [3] INFO NCCL_SINGLE_RING_THRESHOLD=131072 rig:2637:2637 [3] INFO Ring 00 : 0 1 2 3 rig:2637:2637 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer rig:2637:2637 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer rig:2637:2637 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer rig:2637:2637 [3] INFO Ring 00 : 3[3] -> 0[0] via P2P/direct pointer rig:2637:2637 [0] INFO Launch mode Group/CGMD ^C gives: File "/home/minimumnz/anaconda3/envs/tacotron/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock elif lock.acquire(block, timeout): it seems stuck in some thread.
Collecting environment information... PyTorch version: 0.4.0 Is debug build: No CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 18.04 LTS GCC version: (Ubuntu 7.3.0-16ubuntu3) 7.3.0 CMake version: Could not collect
Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti GPU 2: GeForce GTX 1080 Ti GPU 3: GeForce GTX 1080 Ti
Nvidia driver version: 390.67 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.3 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
Versions of relevant libraries:
[pip] numpy (1.14.5)
[pip] torch (0.4.0)
[pip] torchvision (0.2.1)
[conda] cuda90 1.0 h6433d27_0 pytorch
[conda] pytorch 0.4.0 py36_cuda9.0.176_cudnn7.1.2_1 [cuda90] pytorch
[conda] torch 0.4.0
Same problem here, ubuntu 18, python3.5
neuromancer:13802:13802 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
neuromancer:13802:13802 [1] INFO Using internal Network Socket
neuromancer:13802:13802 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
NCCL version 2.2.13+cuda9.2
neuromancer:13802:13802 [0] INFO comm 0xb25bf840 rank 0 nranks 2
neuromancer:13802:13802 [1] INFO comm 0xb3886c70 rank 1 nranks 2
neuromancer:13802:13802 [0] INFO NET : Using interface enp8s0:192.168.101.194<0>
neuromancer:13802:13802 [0] INFO NET : Using interface veesion:192.168.151.19<0>
neuromancer:13802:13802 [0] INFO NET : Using interface docker0:172.17.0.1<0>
neuromancer:13802:13802 [0] INFO NET : Using interface veth79b37a6:fe80::8ce8:bff:fe47:e52d%veth79b37a6<0>
neuromancer:13802:13802 [0] INFO NET/Socket : 4 interfaces found
neuromancer:13802:13802 [1] INFO Using 256 threads
neuromancer:13802:13802 [1] INFO Min Comp Cap 6
neuromancer:13802:13802 [1] INFO NCCL_SINGLE_RING_THRESHOLD=131072
neuromancer:13802:13802 [1] INFO Ring 00 : 0 1
neuromancer:13802:13802 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
neuromancer:13802:13802 [1] INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
neuromancer:13802:13802 [0] INFO Launch mode Group/CGMD
*hangs*
I see the issue is closed. Does it mean there is a workaround ? Thanks for sharing :)
@jhagege
@sjeaugey pointed out that in my case the two vars NCCL_SHM_DISABLE=1 and NCCL_P2P_DISABLE=1 were set in /etc/nccl.conf. Removing these two lines solved the problem. I will close this issue now. Thanks @sjeaugey !!
This is why it's closed. Although it's uncertain whether everyone has the same problem as OP. I am not working on the machine that I had the issue on at the moment, so haven't checked. Anyway, probably worth trying this solution and opening a new issue if it does not fix your problem.
For anyone facing this or a similar issue and who landed on this page, check out this solution on a similar issue.
Tl;dr, disable IOMMU by changing/adding the line GRUB_CMDLINE_LINUX="iommu=soft"
to /etc/default/grub
and rebooting. This solved an issue with NCCL that presented the same symptoms for me after upgrading to driver v396.
Issue description
The snippet below hangs with PyTorch 0.4 but successfully finishes with PyTorch 0.3.1. I found that removing
model = nn.DataParallel(model).cuda()
allows the snippet pass.Code example
Running above inside a docker container produces the following output.
System Info
OS: Debian GNU/Linux 8 (jessie) GCC version: (Debian 4.9.2-10) 4.9.2 CMake version: version 3.0.2
Python version: 2.7 Is CUDA available: Yes CUDA runtime version: 8.0.61 GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti
Nvidia driver version: 390.25 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.2 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
Versions of relevant libraries: [pip] numpy (1.14.2) [pip] torch (0.4.0) [pip] torchvision (0.2.1) [conda] magma-cuda80 2.3.0 1 soumith [conda] pytorch 0.4.0 py27_cuda8.0.61_cudnn7.1.2_1 pytorch [conda] torchvision 0.2.1 py27_1 pytorch