liuhengyue / fcsgg

A PyTorch implementation for the paper: Fully Convolutional Scene Graph Generation, CVPR 2021
MIT License
27 stars 2 forks source link

Issue for training Fully convolutional SceneGraph on GoogleColab Pro+ using single GPU #6

Closed tasveerahmad closed 10 months ago

tasveerahmad commented 2 years ago

Herewith, I am facing the problem of training for fcsgg on GoogleColab Pro+ using single gpu. The error report is here below. The main error/issue it raises is, "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group." When I searched this error on internet, it was pointed out that distributed gpu processing in pytorch has not been initialized. I would request if you could please guide me, how to run fcsgg-code on GoogleColab Pro+.

[05/30 06:00:01 d2.data.common]: Serializing 8 elements to byte tensors and concatenating them all ... [05/30 06:00:01 d2.data.common]: Serialized dataset takes 0.01 MiB [05/30 06:00:01 d2.data.build]: Using training sampler TrainingSampler [05/30 06:00:03 fvcore.common.checkpoint]: No checkpoint found. Initializing model from scratch [05/30 06:00:03 d2.engine.train_loop]: Starting training from iteration 0 /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/boxes.py:151: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:210.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/boxes.py:151: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:210.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/boxes.py:151: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:210.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/boxes.py:151: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:210.) tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/structures/image_list.py:99: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). max_size = torch.cat([max_size[:-2], (max_size[-2:] + (stride - 1)) // stride * stride]) /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/data/detection_utils.py:171: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). starts, ends = -diameters // 2, (diameters + 1) // 2 /usr/local/lib/python3.7/dist-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/data/detection_utils.py:266: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). subject_centers = subject_centers // output_stride /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/data/detection_utils.py:267: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). object_centers = object_centers // output_stride /content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/data/detection_utils.py:335: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). valid = (dist_along_rel <= s2o_vector_norms[..., None] / 2 + sigma // 2) \ ERROR [05/30 06:00:07 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/train_loop.py", line 142, in train self.run_step() File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/train_loop.py", line 235, in run_step loss_dict = self.model(data) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/modeling/meta_arch/onestage_detector.py", line 301, in forward features = self.backbone(images.tensor) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/modeling/backbone/hrnet.py", line 463, in forward x = self.bn1(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/batchnorm.py", line 731, in forward world_size = torch.distributed.get_world_size(process_group) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size return _get_group_size(group) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size default_pg = _get_default_group() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 430, in _get_default_group "Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. [05/30 06:00:08 d2.engine.hooks]: Total training time: 0:00:04 (0:00:00 on hooks) [05/30 06:00:08 d2.utils.events]: iter: 0 lr: N/A max_mem: 253M Traceback (most recent call last): File "tools/train_net.py", line 160, in args=(args,), File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/launch.py", line 62, in launch main_func(args) File "tools/train_net.py", line 148, in main return trainer.train() File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/defaults.py", line 412, in train super().train(self.start_iter, self.max_iter) File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/train_loop.py", line 142, in train self.run_step() File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/detectron2/detectron2/engine/train_loop.py", line 235, in run_step loss_dict = self.model(data) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/modeling/meta_arch/onestage_detector.py", line 301, in forward features = self.backbone(images.tensor) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/content/drive/MyDrive/tasveer_exp/SceneGraph/Fully_Conv_SceneGraph/fcsgg/fcsgg/modeling/backbone/hrnet.py", line 463, in forward x = self.bn1(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/batchnorm.py", line 731, in forward world_size = torch.distributed.get_world_size(process_group) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size return _get_group_size(group) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size default_pg = _get_default_group() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 430, in _get_default_group "Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

liuhengyue commented 2 years ago

Based on this error, I believe it is the problem of SyncBN. Set HEADS.NORM with either 'GN', 'BN', or ''.

tasveerahmad commented 2 years ago

Sorry, it has not been fixed with setting HEADS.NORM with either 'GN', 'BN', or '' in the config file. The problem remains the same that of init_process_group

liuhengyue commented 2 years ago

Can you try run it locally? I am not familiar with GoogleColab. You can cehck the documentation and the community of Detectron2 for more help.