RuntimeError: NCCL Error 2: unhandled system error

waduhekx commented 3 years ago

when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below:

Traceback (most recent call last): File "main.py", line 378, in main() File "main.py", line 194, in main train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer) File "main.py", line 244, in train output = model(input_var) File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape tensor_copies = Broadcast.apply(devices, tensors) File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 21, in forward outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus) File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error

waduhekx commented 3 years ago

how can i solve this problem? please.

Luffy03 commented 2 years ago

Have you solved the problem? Would you please share your solution? thx

mit-han-lab / temporal-shift-module

RuntimeError: NCCL Error 2: unhandled system error #198

when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below: