Open mihaela-stoian opened 2 years ago
Hello, I have been getting the same error. I added NCCL_DEBUG="INFO" to the environment variables, and here is the output.
0%| | 0/70 [00:00<?, ?it/s]dgk307:30853:30853 [0] NCCL INFO Bootstrap : Using [0]ib0:172.31.116.107<0> [1]ib1:172.31.116.108<0>
dgk307:30853:30853 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dgk307:30853:30853 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB ; OOB ib0:172.31.116.107<0>
dgk307:30853:30853 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda10.1
dgk307:30853:34442 [7] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
dgk307:30853:34442 [7] NCCL INFO transport/net_ib.cc:448 -> 2
dgk307:30853:34442 [7] NCCL INFO include/net.h:21 -> 2
dgk307:30853:34442 [7] NCCL INFO include/net.h:51 -> 2
dgk307:30853:34442 [7] NCCL INFO init.cc:300 -> 2
dgk307:30853:34442 [7] NCCL INFO init.cc:566 -> 2
dgk307:30853:34442 [7] NCCL INFO init.cc:840 -> 2
dgk307:30853:34442 [7] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:34438 [3] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
dgk307:30853:34438 [3] NCCL INFO transport/net_ib.cc:448 -> 2
dgk307:30853:34438 [3] NCCL INFO include/net.h:21 -> 2
dgk307:30853:34438 [3] NCCL INFO include/net.h:51 -> 2
dgk307:30853:34438 [3] NCCL INFO init.cc:300 -> 2
dgk307:30853:34438 [3] NCCL INFO init.cc:566 -> 2
dgk307:30853:34438 [3] NCCL INFO init.cc:840 -> 2
dgk307:30853:34438 [3] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:34435 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
dgk307:30853:34435 [0] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34435 [0] NCCL INFO include/socket.h:450 -> 2
dgk307:30853:34435 [0] NCCL INFO bootstrap.cc:134 -> 2
dgk307:30853:34435 [0] NCCL INFO bootstrap.cc:353 -> 2
dgk307:30853:34435 [0] NCCL INFO init.cc:567 -> 2
dgk307:30853:34435 [0] NCCL INFO init.cc:840 -> 2
dgk307:30853:34435 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:34437 [2] include/socket.h:421 NCCL WARN Call to recv failed : Broken pipe
dgk307:30853:34437 [2] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34437 [2] NCCL INFO include/socket.h:444 -> 2
dgk307:30853:34437 [2] NCCL INFO bootstrap.cc:128 -> 2
dgk307:30853:34437 [2] NCCL INFO bootstrap.cc:351 -> 2
dgk307:30853:34437 [2] NCCL INFO init.cc:567 -> 2
dgk307:30853:34437 [2] NCCL INFO init.cc:840 -> 2
dgk307:30853:34437 [2] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:34436 [1] include/socket.h:421 NCCL WARN Call to recv failed : Broken pipe
dgk307:30853:34436 [1] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34436 [1] NCCL INFO include/socket.h:444 -> 2
dgk307:30853:34436 [1] NCCL INFO bootstrap.cc:128 -> 2
dgk307:30853:34436 [1] NCCL INFO bootstrap.cc:351 -> 2
dgk307:30853:34436 [1] NCCL INFO init.cc:567 -> 2
dgk307:30853:34436 [1] NCCL INFO init.cc:840 -> 2
dgk307:30853:34436 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:34441 [6] include/socket.h:421 NCCL WARN Call to recv failed : Broken pipe
dgk307:30853:34441 [6] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34441 [6] NCCL INFO include/socket.h:444 -> 2
dgk307:30853:34441 [6] NCCL INFO bootstrap.cc:128 -> 2
dgk307:30853:34441 [6] NCCL INFO bootstrap.cc:351 -> 2
dgk307:30853:34441 [6] NCCL INFO init.cc:567 -> 2
dgk307:30853:34441 [6] NCCL INFO init.cc:840 -> 2
dgk307:30853:34441 [6] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:34439 [4] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
dgk307:30853:34439 [4] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34439 [4] NCCL INFO include/socket.h:450 -> 2
dgk307:30853:34439 [4] NCCL INFO bootstrap.cc:134 -> 2
dgk307:30853:34439 [4] NCCL INFO bootstrap.cc:353 -> 2
dgk307:30853:34439 [4] NCCL INFO init.cc:567 -> 2
dgk307:30853:34439 [4] NCCL INFO init.cc:840 -> 2
dgk307:30853:34439 [4] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:34440 [5] include/socket.h:421 NCCL WARN Call to recv failed : Connection reset by peer
dgk307:30853:34440 [5] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34440 [5] NCCL INFO include/socket.h:444 -> 2
dgk307:30853:34440 [5] NCCL INFO bootstrap.cc:127 -> 2
dgk307:30853:34440 [5] NCCL INFO bootstrap.cc:351 -> 2
dgk307:30853:34440 [5] NCCL INFO init.cc:567 -> 2
dgk307:30853:34440 [5] NCCL INFO init.cc:840 -> 2
dgk307:30853:34440 [5] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:30853 [0] NCCL INFO init.cc:906 -> 2
0%| | 0/70 [00:23<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 80, in <module>
train()
File "train.py", line 51, in train
flow_gt, conf_gt = flowNet(data_list, epoch)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/project/few-shot-vid2vid/models/models.py", line 95, in forward
outputs = self.model(*inputs, **kwargs, dummy_bs=self.pad_bs)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 160, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/project/few-shot-vid2vid/models/networks/sync_batchnorm/replicate.py", line 72, in replicate
modules = super(DataParallelWithCallback, self).replicate(module, device_ids)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/project/few-shot-vid2vid/models/networks/sync_batchnorm/replicate.py", line 26, in replicate
replicas = super(DataParallel, self).replicate(module, device_ids)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 56, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error
I have a question regarding the behaviour of some jobs submitted to JADE2. After a slurm job has started I sometimes get a "RuntimeError: NCCL Error 2: unhandled system error". This only happens on some runs and not all the time. For example, while the code remained unchanged, I submitted two different jobs for it: one ran without issues, whereas the other terminated with this error.
Could the issue be related to some specific nodes which the job gets allocated to? The nodes on which this happened are: dgk319, dgk407, dgk204, dgk504, dgk210. I submitted most slurm jobs with the following specifications: partition=big, nodes=1, time=1-00:00:00, gres=gpu:4.
Is there something I could do to avoid getting this error? Thank you for your consideration!