PyTorch 0.4 hangs with nn.DataParallel but PyTorch 0.3.1 does not

klshrinidhi commented 6 years ago

Issue description

The snippet below hangs with PyTorch 0.4 but successfully finishes with PyTorch 0.3.1. I found that removing model = nn.DataParallel(model).cuda() allows the snippet pass.

Code example

import torch
import torch.nn as nn
from torch.autograd import Variable

class NET(nn.Module):
    def __init__(self):
        super(NET, self).__init__()
        self.dense = nn.Linear(256, 512)

    def forward(self, input):
        return self.dense(input)

if __name__ == '__main__':
    model = NET()
    model = nn.DataParallel(model).cuda()
    x = Variable(torch.rand(128, 256))
    y = model(x) ##### <<<<--- GETS STUCK HERE FOREVER

Running above inside a docker container produces the following output.

NCCL version 2.1.15+cuda8.0
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/tmp/k6WOhB/60:/tmp/k6WOhB/59:/tmp/k6WOhB/58:/tmp/k6WOhB/57:/tmp/k6WOhB/56:/tmp/k6WOhB/55:/tmp/k6WOhB/54:/tmp/k6WOhB/53:/tmp/k6WOhB/52:/tmp/k6WOhB/51:/tmp/k6WOhB/50:/tmp/k6WOhB/49:/tmp/k6WOhB/48:/tmp/k6WOhB/47:/tmp/k6WOhB/46:/tmp/k6WOhB/45:/tmp/k6WOhB/44:/tmp/k6WOhB/43:/tmp/k6WOhB/42:/tmp/k6WOhB/41:/tmp/k6WOhB/40:/tmp/k6WOhB/39:/tmp/k6WOhB/38:/tmp/k6WOhB/37:/tmp/k6WOhB/36:/tmp/k6WOhB/35:/tmp/k6WOhB/34:/tmp/k6WOhB/33:/tmp/k6WOhB/32:/tmp/k6WOhB/31:/tmp/k6WOhB/30:/tmp/k6'
Unexpected end of /proc/mounts line `WOhB/29:/tmp/k6WOhB/28:/tmp/k6WOhB/27:/tmp/k6WOhB/26:/tmp/k6WOhB/25:/tmp/k6WOhB/24:/tmp/k6WOhB/23:/tmp/k6WOhB/22:/tmp/k6WOhB/21:/tmp/k6WOhB/20:/tmp/k6WOhB/19:/tmp/k6WOhB/18:/tmp/k6WOhB/17:/tmp/k6WOhB/16:/tmp/k6WOhB/15:/tmp/k6WOhB/14:/tmp/k6WOhB/13:/tmp/k6WOhB/12:/tmp/k6WOhB/11:/tmp/k6WOhB/10:/tmp/k6WOhB/9:/tmp/k6WOhB/8:/tmp/k6WOhB/7:/tmp/k6WOhB/6:/tmp/k6WOhB/5:/tmp/k6WOhB/4:/tmp/k6WOhB/3:/tmp/k6WOhB/2:/tmp/k6WOhB/1:/tmp/k6WOhB/0,upperdir=/mnt/01/mesos_work/provisioner/containers/4436f199-bf49-453f-9b95-e84'
<<<< STUCK HERE>>>

System Info

Running inside a docker container.
openmpi 3.0.0
nccl 2.15

nnpack (commit -- 04d045a2efe3785edcd7ccc72c2e81dc7a3377c3)


Collecting environment information...
PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 8.0.61

OS: Debian GNU/Linux 8 (jessie) GCC version: (Debian 4.9.2-10) 4.9.2 CMake version: version 3.0.2

Python version: 2.7 Is CUDA available: Yes CUDA runtime version: 8.0.61 GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 390.25 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.2 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries: [pip] numpy (1.14.2) [pip] torch (0.4.0) [pip] torchvision (0.2.1) [conda] magma-cuda80 2.3.0 1 soumith [conda] pytorch 0.4.0 py27_cuda8.0.61_cudnn7.1.2_1 pytorch [conda] torchvision 0.2.1 py27_1 pytorch



Happy to provide any more information.

reynoldscem commented 6 years ago

I have also experienced DataParallel hanging in 0.4.

Collecting environment information...
PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 9.1.85

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti
GPU 3: GeForce GTX 1080 Ti

Nvidia driver version: 390.48
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy (1.14.2)
[pip3] torch (0.4.0)
[pip3] torchvision (0.2.1)
[conda] Could not collect

Extra info: I'm not using conda, and it seemingly only hangs when using 4, not 2 GPUs.

apaszke commented 6 years ago

cc: @teng-li who knows all NCCL2 caveats cc: @ngimel who knows everything about NVIDIA-things

klshrinidhi commented 6 years ago

Debugging a little bit more, I found that the function call ncclCommInitAll(comms.get(), devices.size(), devices.data()) located at https://github.com/pytorch/pytorch/blob/master/torch/csrc/cuda/nccl.cpp#L29 is the one getting stuck. I am happy to work on a fix if I get some pointers. Thank you !!

ngimel commented 6 years ago

Could not repro neither on 2,4, nor 8 GPUs, neither in environment where there's no cuda libraries installed, only pytorch conda package, nor with existing toolkit installation.

[root@8017507a5cad playground]# python collect_env.py 
Collecting environment information...
PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 8.0.61

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
CMake version: version 2.8.12.2

Python version: 2.7
Is CUDA available: Yes
CUDA runtime version: 8.0.61
GPU models and configuration: 
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB
GPU 4: Tesla P100-SXM2-16GB
GPU 5: Tesla P100-SXM2-16GB
GPU 6: Tesla P100-SXM2-16GB
GPU 7: Tesla P100-SXM2-16GB

Nvidia driver version: 384.125
cuDNN version: Probably one of the following:
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.7.1.2
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn_static.a
/usr/local/cuda-9.0/lib64/libcudnn.so.7.1.2
/usr/local/cuda-9.0/lib64/libcudnn_static.a
/usr/local/cuda-9.1/lib64/libcudnn.so.7.1.2
/usr/local/cuda-9.1/lib64/libcudnn_static.a

Versions of relevant libraries:
[pip] numpy (1.14.2)
[pip] torch (0.4.0)
[pip] torchvision (0.2.1)
[conda] pytorch                   0.4.0           py27_cuda8.0.61_cudnn7.1.2_1    pytorch
[conda] torchvision               0.2.1                    py27_1    pytorch

ngimel commented 6 years ago

I also can't repro with pytorch cuda 9.1 + 390.30 driver. Can people who can repro please run with export NCCL_DEBUG=INFO and post the output?

klshrinidhi commented 6 years ago

Hello @ngimel, Sorry for the delay. Here is the output I see for the snippet I posted. NOTE: I have replaced hostname, port, ip-address.

hostname:portnum [0] INFO Using internal Network Socket
hostname:portnum [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
NCCL version 2.1.15+cuda8.0
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/tmp/6VmhvL/62:/tmp/6VmhvL/61:/tmp/6VmhvL/60:/tmp/6VmhvL/59:/tmp/6VmhvL/58:/tmp/6VmhvL/57:/tmp/6VmhvL/56:/tmp/6VmhvL/55:/tmp/6VmhvL/54:/tmp/6VmhvL/53:/tmp/6VmhvL\
/52:/tmp/6VmhvL/51:/tmp/6VmhvL/50:/tmp/6VmhvL/49:/tmp/6VmhvL/48:/tmp/6VmhvL/47:/tmp/6VmhvL/46:/tmp/6VmhvL/45:/tmp/6VmhvL/44:/tmp/6VmhvL/43:/tmp/6VmhvL/42:/tmp/6VmhvL/41:/tmp/6VmhvL/40:/tmp/6VmhvL/39:/tmp/6VmhvL/38:/tmp/6VmhvL/37:/tmp/6Vm\
hvL/36:/tmp/6VmhvL/35:/tmp/6VmhvL/34:/tmp/6VmhvL/33:/tmp/6VmhvL/32:/tmp/6V'
Unexpected end of /proc/mounts line `mhvL/31:/tmp/6VmhvL/30:/tmp/6VmhvL/29:/tmp/6VmhvL/28:/tmp/6VmhvL/27:/tmp/6VmhvL/26:/tmp/6VmhvL/25:/tmp/6VmhvL/24:/tmp/6VmhvL/23:/tmp/6VmhvL/22:/tmp/6VmhvL/21:/tmp/6VmhvL/20:/tmp/6VmhvL/19:/tmp/6VmhvL/\
18:/tmp/6VmhvL/17:/tmp/6VmhvL/16:/tmp/6VmhvL/15:/tmp/6VmhvL/14:/tmp/6VmhvL/13:/tmp/6VmhvL/12:/tmp/6VmhvL/11:/tmp/6VmhvL/10:/tmp/6VmhvL/9:/tmp/6VmhvL/8:/tmp/6VmhvL/7:/tmp/6VmhvL/6:/tmp/6VmhvL/5:/tmp/6VmhvL/4:/tmp/6VmhvL/3:/tmp/6VmhvL/2:/t\
mp/6VmhvL/1:/tmp/6VmhvL/0,upperdir=/mnt/01/mesos_work/provisioner/containe'
hostname:portnum [0] INFO NET : Using interface eth0:ipaddress<0>
hostname:portnum [0] INFO NET/Socket : 1 interfaces found
hostname:portnum [1] INFO Using 256 threads
hostname:portnum [1] INFO Min Comp Cap 6
hostname:portnum [1] INFO NCCL_SINGLE_RING_THRESHOLD=131072
hostname:portnum [1] INFO Ring 00 :    0   1
hostname:portnum [0] INFO 1 -> 0 via NET/Socket/0
hostname:portnum [1] INFO 0 -> 1 via NET/Socket/0

sjeaugey commented 6 years ago

Hi @klshrinidhi -- there is indeed something wrong here, the 2 GPUs you are using are supposed to be on the same machine (and even managed within the same process), yet NCCL detects them as two different machines and tries to use NET/Socket (the trace should show P2P or SHM, but not NET).

Can you confirm that the "hostname:portnum"(*) is the same on both GPUs, i.e. on lines with [0] and lines with [1] ?

I also assume you don't set NCCL_SHM_DISABLE=1 and NCCL_P2P_DISABLE=1 ?

Anyway, feel free to open a bug on https://developer.nvidia.com/user (My bugs) and select "Deep Learning Toolkit / NCCL" as relevant area so that we can investigate further. Thanks !

(*) it is actually hostname:PID

klshrinidhi commented 6 years ago

Thanks @sjeaugey, Yes. I just double checked that hostname:portnum are exactly the same for both GPUs, which are physically on the same machine. Also checked that the env vars you mention are not set. The only NCCL env var I have set is NCCL_DEBUG=INFO.

I have reported the bug --> https://developer.nvidia.com/nvidia_bug/2111134

klshrinidhi commented 6 years ago

@sjeaugey pointed out that in my case the two vars NCCL_SHM_DISABLE=1 and NCCL_P2P_DISABLE=1 were set in /etc/nccl.conf. Removing these two lines solved the problem. I will close this issue now. Thanks @sjeaugey !!

minimumnz commented 6 years ago

I have this same problem with nccl. The code in the original post does not return. klshrinidhi's solution does not fix it for me. I do not have a /etc/nccl.conf in Ubuntu 18.04 but I did unset those environment variables and I get this:

NCCL version 2.1.15+cuda9.0 rig:2637:2637 [0] INFO NET : Using interface enp0s31f6:192.168.85.32<0> rig:2637:2637 [0] INFO NET/Socket : 1 interfaces found rig:2637:2637 [3] INFO Using 256 threads rig:2637:2637 [3] INFO Min Comp Cap 6 rig:2637:2637 [3] INFO NCCL_SINGLE_RING_THRESHOLD=131072 rig:2637:2637 [3] INFO Ring 00 : 0 1 2 3 rig:2637:2637 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer rig:2637:2637 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer rig:2637:2637 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer rig:2637:2637 [3] INFO Ring 00 : 3[3] -> 0[0] via P2P/direct pointer rig:2637:2637 [0] INFO Launch mode Group/CGMD ^C gives: File "/home/minimumnz/anaconda3/envs/tacotron/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock elif lock.acquire(block, timeout): it seems stuck in some thread.

Collecting environment information... PyTorch version: 0.4.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04 LTS GCC version: (Ubuntu 7.3.0-16ubuntu3) 7.3.0 CMake version: Could not collect

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti GPU 2: GeForce GTX 1080 Ti GPU 3: GeForce GTX 1080 Ti

Nvidia driver version: 390.67 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.3 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries: [pip] numpy (1.14.5) [pip] torch (0.4.0) [pip] torchvision (0.2.1) [conda] cuda90 1.0 h6433d27_0 pytorch [conda] pytorch 0.4.0 py36_cuda9.0.176_cudnn7.1.2_1 [cuda90] pytorch [conda] torch 0.4.0 [conda] torchvision 0.2.1 py36_1 pytorch

dmenig commented 5 years ago

Same problem here, ubuntu 18, python3.5

neuromancer:13802:13802 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
neuromancer:13802:13802 [1] INFO Using internal Network Socket
neuromancer:13802:13802 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
NCCL version 2.2.13+cuda9.2
neuromancer:13802:13802 [0] INFO comm 0xb25bf840 rank 0 nranks 2
neuromancer:13802:13802 [1] INFO comm 0xb3886c70 rank 1 nranks 2
neuromancer:13802:13802 [0] INFO NET : Using interface enp8s0:192.168.101.194<0>
neuromancer:13802:13802 [0] INFO NET : Using interface veesion:192.168.151.19<0>
neuromancer:13802:13802 [0] INFO NET : Using interface docker0:172.17.0.1<0>
neuromancer:13802:13802 [0] INFO NET : Using interface veth79b37a6:fe80::8ce8:bff:fe47:e52d%veth79b37a6<0>
neuromancer:13802:13802 [0] INFO NET/Socket : 4 interfaces found
neuromancer:13802:13802 [1] INFO Using 256 threads
neuromancer:13802:13802 [1] INFO Min Comp Cap 6
neuromancer:13802:13802 [1] INFO NCCL_SINGLE_RING_THRESHOLD=131072
neuromancer:13802:13802 [1] INFO Ring 00 :    0   1
neuromancer:13802:13802 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
neuromancer:13802:13802 [1] INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
neuromancer:13802:13802 [0] INFO Launch mode Group/CGMD
*hangs*

jhagege commented 5 years ago

I see the issue is closed. Does it mean there is a workaround ? Thanks for sharing :)

reynoldscem commented 5 years ago

@jhagege

@sjeaugey pointed out that in my case the two vars NCCL_SHM_DISABLE=1 and NCCL_P2P_DISABLE=1 were set in /etc/nccl.conf. Removing these two lines solved the problem. I will close this issue now. Thanks @sjeaugey !!

This is why it's closed. Although it's uncertain whether everyone has the same problem as OP. I am not working on the machine that I had the issue on at the moment, so haven't checked. Anyway, probably worth trying this solution and opening a new issue if it does not fix your problem.

Shappenny commented 5 years ago

For anyone facing this or a similar issue and who landed on this page, check out this solution on a similar issue. Tl;dr, disable IOMMU by changing/adding the line GRUB_CMDLINE_LINUX="iommu=soft" to /etc/default/grub and rebooting. This solved an issue with NCCL that presented the same symptoms for me after upgrading to driver v396.

pytorch / pytorch