graspnet / graspnet-baseline

Baseline model for "GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping" (CVPR 2020)
https://graspnet.net/
Other
415 stars 133 forks source link

Is multi-GPU operation possible? #109

Closed huamo555 closed 1 month ago

huamo555 commented 1 month ago

I am trying to get this code to run on multiple GPUs, but am encountering errors.

Traceback (most recent call last): File "new_train.py", line 188, in train(start_epoch) File "new_train.py", line 180, in train train_one_epoch() File "new_train.py", line 146, in train_one_epoch end_points = net(batch_data_label) File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) AttributeError: Caught AttributeError in replica 0 on device 0. Original Traceback (most recent call last): File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/data2/gaoyuming/.cache/graspness_implementation-main/graspnet.py", line 73, in forward seed_features = self.backbone(mink_input).F # mink_input [BNs(C+3)--> BNs512] 输入到backbone模型中,并获取输出的特征数据seed_features File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/data2/gaoyuming/.cache/graspness_implementation-main/backbone_resunet14.py", line 94, in forward out = self.conv0p1s1(x) File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/MinkowskiEngine/MinkowskiConvolution.py", line 302, in forward assert input.D == self.dimension File "/data2/gaoyuming/anaconda3/envs/env_n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 948, in getattr type(self).name, name)) AttributeError: 'MinkowskiConvolution' object has no attribute 'dimension'

chenxi-wang commented 1 month ago

Hi, I've not tested DDP training for GraspNet baseline and GSNet. This error seems to appear in MinkowskiEngine of the unofficial GSNet. But in my experience with other programs, it's okay to train MinkowskEngine in DDP mode.