when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below:
Traceback (most recent call last):
File "main.py", line 378, in
main()
File "main.py", line 194, in main
train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
File "main.py", line 244, in train
output = model(input_var)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, tensors)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error
when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below:
Traceback (most recent call last): File "main.py", line 378, in
main()
File "main.py", line 194, in main
train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
File "main.py", line 244, in train
output = model(input_var)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, tensors)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error