Closed AndongDeng closed 1 year ago
Our current codebase does not support multiple GPU training. Please refer to #38 for more details.
Our current codebase does not support multiple GPU training. Please refer to #38 for more details.
Got it. Thanks.
When I set config as follows to perform multigpu training: "devices": ['cuda:0', 'cuda:1', 'cuda:2', 'cuda:3']
I got the following error:
Traceback (most recent call last): File "train.py", line 178, in
main(args)
File "train.py", line 124, in main
train_one_epoch(
File "/home/dengandong/Research/actionformer_release/libs/utils/train_utils.py", line 277, in train_one_epoch
losses = model(video_list)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, kwargs)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, *kwargs)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/dengandong/Research/actionformer_release/libs/modeling/meta_archs.py", line 339, in forward
feats, masks = self.backbone(batched_inputs, batched_masks)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, kwargs)
File "/home/dengandong/Research/actionformer_release/libs/modeling/backbones.py", line 130, in forward
x, mask = self.embd[idx](x, mask)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, *kwargs)
File "/home/dengandong/Research/actionformer_release/libs/modeling/blocks.py", line 46, in forward
out_conv = self.conv(x)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 298, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/dengandong/anaconda3/envs/gtad_pt110/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 294, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [512, 2048, 3], expected input[2, 512, 2304] to have 2048 channels, but got 512 channels instead
It is weird since the feature dimension seems to be divided by the GPU number.
Similarly, the dimension changes to 1024 when I use 2 GPUs.