jade-hpc-gpu / jade-hpc-gpu.github.io

Joint Academic Data Science Endeavour (JADE) is the largest GPU facility in the UK supporting world-leading research in machine learning (and this is the repo that powers its website)
http://www.jade.ac.uk/
Other
24 stars 8 forks source link

RuntimeError: NCCL Error 2 on JADE2 #168

Open mihaela-stoian opened 2 years ago

mihaela-stoian commented 2 years ago

I have a question regarding the behaviour of some jobs submitted to JADE2. After a slurm job has started I sometimes get a "RuntimeError: NCCL Error 2: unhandled system error". This only happens on some runs and not all the time. For example, while the code remained unchanged, I submitted two different jobs for it: one ran without issues, whereas the other terminated with this error.

Traceback (most recent call last):
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/pycharm_sync/LogicGuidedSSL/main.py", line 362, in <module>
    main()
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/pycharm_sync/LogicGuidedSSL/main.py", line 328, in main
    train_SSL_fixed(args, net, train_dataset, ulb_train_dataset, val_dataset)
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/pycharm_sync/LogicGuidedSSL/train_SSL_fixed.py", line 411, in train_SSL_fixed
    return train(args, net, mixed_train_dataset, val_dataset)
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/pycharm_sync/LogicGuidedSSL/train.py", line 86, in train
    iteration = run_train(args, train_data_loader, net, optimizer, epoch, iteration)
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/pycharm_sync/LogicGuidedSSL/train.py", line 201, in run_train
    loss.backward()
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/software/miniconda3/envs/slowfast_road/lib/python3.9/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/software/miniconda3/envs/slowfast_road/lib/python3.9/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/software/miniconda3/envs/slowfast_road/lib/python3.9/site-packages/torch/autograd/function.py", line 89, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/software/miniconda3/envs/slowfast_road/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 34, in backward
    return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/software/miniconda3/envs/slowfast_road/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 45, in forward
    return comm.reduce_add_coalesced(grads_, destination)
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/software/miniconda3/envs/slowfast_road/lib/python3.9/site-packages/torch/nn/parallel/comm.py", line 143, in reduce_add_coalesced
    flat_result = reduce_add(flat_tensors, destination)
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/software/miniconda3/envs/slowfast_road/lib/python3.9/site-packages/torch/nn/parallel/comm.py", line 96, in reduce_add
    nccl.reduce(inputs, output=result, root=root_index)
  File "/jmain02/home/J2AD009/ttl04/mxs63-ttl04/software/miniconda3/envs/slowfast_road/lib/python3.9/site-packages/torch/cuda/nccl.py", line 90, in reduce
    torch._C._nccl_reduce(inputs, _output, root, op, streams, comms)
RuntimeError: NCCL Error 2: unhandled system error

Could the issue be related to some specific nodes which the job gets allocated to? The nodes on which this happened are: dgk319, dgk407, dgk204, dgk504, dgk210. I submitted most slurm jobs with the following specifications: partition=big, nodes=1, time=1-00:00:00, gres=gpu:4.

Is there something I could do to avoid getting this error? Thank you for your consideration!

digvijayad commented 2 years ago

Hello, I have been getting the same error. I added NCCL_DEBUG="INFO" to the environment variables, and here is the output.

  0%|          | 0/70 [00:00<?, ?it/s]dgk307:30853:30853 [0] NCCL INFO Bootstrap : Using [0]ib0:172.31.116.107<0> [1]ib1:172.31.116.108<0>
dgk307:30853:30853 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dgk307:30853:30853 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB ; OOB ib0:172.31.116.107<0>
dgk307:30853:30853 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda10.1

dgk307:30853:34442 [7] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
dgk307:30853:34442 [7] NCCL INFO transport/net_ib.cc:448 -> 2
dgk307:30853:34442 [7] NCCL INFO include/net.h:21 -> 2
dgk307:30853:34442 [7] NCCL INFO include/net.h:51 -> 2
dgk307:30853:34442 [7] NCCL INFO init.cc:300 -> 2
dgk307:30853:34442 [7] NCCL INFO init.cc:566 -> 2
dgk307:30853:34442 [7] NCCL INFO init.cc:840 -> 2
dgk307:30853:34442 [7] NCCL INFO group.cc:73 -> 2 [Async thread]

dgk307:30853:34438 [3] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
dgk307:30853:34438 [3] NCCL INFO transport/net_ib.cc:448 -> 2
dgk307:30853:34438 [3] NCCL INFO include/net.h:21 -> 2
dgk307:30853:34438 [3] NCCL INFO include/net.h:51 -> 2
dgk307:30853:34438 [3] NCCL INFO init.cc:300 -> 2
dgk307:30853:34438 [3] NCCL INFO init.cc:566 -> 2
dgk307:30853:34438 [3] NCCL INFO init.cc:840 -> 2
dgk307:30853:34438 [3] NCCL INFO group.cc:73 -> 2 [Async thread]

dgk307:30853:34435 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
dgk307:30853:34435 [0] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34435 [0] NCCL INFO include/socket.h:450 -> 2
dgk307:30853:34435 [0] NCCL INFO bootstrap.cc:134 -> 2
dgk307:30853:34435 [0] NCCL INFO bootstrap.cc:353 -> 2
dgk307:30853:34435 [0] NCCL INFO init.cc:567 -> 2
dgk307:30853:34435 [0] NCCL INFO init.cc:840 -> 2
dgk307:30853:34435 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

dgk307:30853:34437 [2] include/socket.h:421 NCCL WARN Call to recv failed : Broken pipe
dgk307:30853:34437 [2] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34437 [2] NCCL INFO include/socket.h:444 -> 2
dgk307:30853:34437 [2] NCCL INFO bootstrap.cc:128 -> 2
dgk307:30853:34437 [2] NCCL INFO bootstrap.cc:351 -> 2
dgk307:30853:34437 [2] NCCL INFO init.cc:567 -> 2
dgk307:30853:34437 [2] NCCL INFO init.cc:840 -> 2
dgk307:30853:34437 [2] NCCL INFO group.cc:73 -> 2 [Async thread]

dgk307:30853:34436 [1] include/socket.h:421 NCCL WARN Call to recv failed : Broken pipe
dgk307:30853:34436 [1] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34436 [1] NCCL INFO include/socket.h:444 -> 2
dgk307:30853:34436 [1] NCCL INFO bootstrap.cc:128 -> 2
dgk307:30853:34436 [1] NCCL INFO bootstrap.cc:351 -> 2
dgk307:30853:34436 [1] NCCL INFO init.cc:567 -> 2
dgk307:30853:34436 [1] NCCL INFO init.cc:840 -> 2
dgk307:30853:34436 [1] NCCL INFO group.cc:73 -> 2 [Async thread]

dgk307:30853:34441 [6] include/socket.h:421 NCCL WARN Call to recv failed : Broken pipe
dgk307:30853:34441 [6] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34441 [6] NCCL INFO include/socket.h:444 -> 2
dgk307:30853:34441 [6] NCCL INFO bootstrap.cc:128 -> 2
dgk307:30853:34441 [6] NCCL INFO bootstrap.cc:351 -> 2
dgk307:30853:34441 [6] NCCL INFO init.cc:567 -> 2
dgk307:30853:34441 [6] NCCL INFO init.cc:840 -> 2
dgk307:30853:34441 [6] NCCL INFO group.cc:73 -> 2 [Async thread]

dgk307:30853:34439 [4] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
dgk307:30853:34439 [4] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34439 [4] NCCL INFO include/socket.h:450 -> 2
dgk307:30853:34439 [4] NCCL INFO bootstrap.cc:134 -> 2
dgk307:30853:34439 [4] NCCL INFO bootstrap.cc:353 -> 2
dgk307:30853:34439 [4] NCCL INFO init.cc:567 -> 2
dgk307:30853:34439 [4] NCCL INFO init.cc:840 -> 2
dgk307:30853:34439 [4] NCCL INFO group.cc:73 -> 2 [Async thread]

dgk307:30853:34440 [5] include/socket.h:421 NCCL WARN Call to recv failed : Connection reset by peer
dgk307:30853:34440 [5] NCCL INFO include/socket.h:438 -> 2
dgk307:30853:34440 [5] NCCL INFO include/socket.h:444 -> 2
dgk307:30853:34440 [5] NCCL INFO bootstrap.cc:127 -> 2
dgk307:30853:34440 [5] NCCL INFO bootstrap.cc:351 -> 2
dgk307:30853:34440 [5] NCCL INFO init.cc:567 -> 2
dgk307:30853:34440 [5] NCCL INFO init.cc:840 -> 2
dgk307:30853:34440 [5] NCCL INFO group.cc:73 -> 2 [Async thread]
dgk307:30853:30853 [0] NCCL INFO init.cc:906 -> 2

  0%|          | 0/70 [00:23<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 80, in <module>
    train()
  File "train.py", line 51, in train
    flow_gt, conf_gt = flowNet(data_list, epoch)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/project/few-shot-vid2vid/models/models.py", line 95, in forward
    outputs = self.model(*inputs, **kwargs, dummy_bs=self.pad_bs)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 160, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/project/few-shot-vid2vid/models/networks/sync_batchnorm/replicate.py", line 72, in replicate
    modules = super(DataParallelWithCallback, self).replicate(module, device_ids)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/project/few-shot-vid2vid/models/networks/sync_batchnorm/replicate.py", line 26, in replicate
    replicas = super(DataParallel, self).replicate(module, device_ids)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/jmain02/home/J2AD020/qxm07/ddn91-qxm07/mambaforge/envs/fsvid/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 56, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error