Closed tasveerahmad closed 10 months ago
Based on this error, I believe it is the problem of SyncBN. Set HEADS.NORM with either 'GN', 'BN', or ''.
Sorry, it has not been fixed with setting HEADS.NORM with either 'GN', 'BN', or '' in the config file. The problem remains the same that of init_process_group
Can you try run it locally? I am not familiar with GoogleColab. You can cehck the documentation and the community of Detectron2 for more help.
Herewith, I am facing the problem of training for fcsgg on GoogleColab Pro+ using single gpu. The error report is here below. The main error/issue it raises is, "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group." When I searched this error on internet, it was pointed out that distributed gpu processing in pytorch has not been initialized. I would request if you could please guide me, how to run fcsgg-code on GoogleColab Pro+.
[05/30 06:00:01 d2.data.common]: Serializing 8 elements to byte tensors and concatenating them all ... [05/30 06:00:01 d2.data.common]: Serialized dataset takes 0.01 MiB [05/30 06:00:01 d2.data.build]: Using training sampler TrainingSampler [05/30 06:00:03 fvcore.common.checkpoint]: No checkpoint found. Initializing model from scratch [05/30 06:00:03 d2.engine.train_loop]: Starting training from iteration 0 /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/boxes.py:151: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:210.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/boxes.py:151: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:210.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/boxes.py:151: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:210.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/boxes.py:151: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:210.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/image_list.py:99: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). max_size = torch.cat([max_size[:-2], (max_size[-2:] + (stride - 1)) // stride * stride]) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/data/detection_utils.py:171: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). starts, ends = -diameters // 2, (diameters + 1) // 2 /usr/local/lib/python3.7/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/data/detection_utils.py:266: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). subject_centers = subject_centers // output_stride /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/data/detection_utils.py:267: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). object_centers = object_centers // output_stride /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/data/detection_utils.py:335: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). valid = (dist_along_rel <= s2o_vector_norms[..., None] / 2 + sigma // 2) \ ERROR [05/30 06:00:07 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/train_loop.py", line 142, in train self.run_step() File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/train_loop.py", line 235, in run_step loss_dict = self.model(data) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/modeling/meta_arch/onestage_detector.py", line 301, in forward features = self.backbone(images.tensor) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/modeling/backbone/hrnet.py", line 463, in forward x = self.bn1(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/batchnorm.py", line 731, in forward world_size = torch.distributed.get_world_size(process_group) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size return _get_group_size(group) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size default_pg = _get_default_group() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 430, in _get_default_group "Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. [05/30 06:00:08 d2.engine.hooks]: Total training time: 0:00:04 (0:00:00 on hooks) [05/30 06:00:08 d2.utils.events]: iter: 0 lr: N/A max_mem: 253M Traceback (most recent call last): File "tools/train_net.py", line 160, in
args=(args,),
File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/launch.py", line 62, in launch
main_func(args)
File "tools/train_net.py", line 148, in main
return trainer.train()
File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/defaults.py", line 412, in train
super().train(self.start_iter, self.max_iter)
File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/train_loop.py", line 142, in train
self.run_step()
File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/train_loop.py", line 235, in run_step
loss_dict = self.model(data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(input, kwargs)
File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/modeling/meta_arch/onestage_detector.py", line 301, in forward
features = self.backbone(images.tensor)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, *kwargs)
File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/modeling/backbone/hrnet.py", line 463, in forward
x = self.bn1(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/batchnorm.py", line 731, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
return _get_group_size(group)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
default_pg = _get_default_group()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 430, in _get_default_group
"Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.