problem on reproducing - Githubissues

wishforgood commented 1 month ago

Hi, thanks for your great work and I really apreaciate it. However when I tried to reproduce it, I find that I have only single GTX 2080Ti GPU compared to 4 NVIDIA RTX 3090 GPUs, is it possible to reproduce the performance? I have tried to run a baseline training but an error sometimes occured during training or testing (I have followed all the instructions except 10x all iteration numbers needed to compensate that my batch size is only 2 compared to 20 described in the paper):

`[10/20 13:45:13 d2.utils.events]: eta: 3 days, 22:34:55 iter: 762999 total_loss: 235.6 loss_ce_subject: 0.5631 loss_bbox_subject: 0.5984 loss_giou_subject: 0.6384 loss_ce_object: 0.5968 loss_bbox_object: 0.9082 loss_giou_object: 0.7517 loss_relation: 0.88 loss_bbox_relation: 0.5787 loss_giou_relation: 1.399 loss_ce_subject_0: 0.6388 loss_bbox_subject_0: 0.7614 loss_giou_subject_0: 0.7747 loss_ce_object_0: 0.6303 loss_bbox_object_0: 0.9505 loss_giou_object_0: 0.8287 loss_relation_0: 0.9154 loss_bbox_relation_0: 0.7139 loss_giou_relation_0: 1.476 loss_ce_subject_1: 0.6042 loss_bbox_subject_1: 0.6771 loss_giou_subject_1: 0.6987 loss_ce_object_1: 0.6119 loss_bbox_object_1: 0.9434 loss_giou_object_1: 0.7924 loss_relation_1: 0.9057 loss_bbox_relation_1: 0.6494 loss_giou_relation_1: 1.443 loss_ce_subject_2: 0.6079 loss_bbox_subject_2: 0.6239 loss_giou_subject_2: 0.6446 loss_ce_object_2: 0.6446 loss_bbox_object_2: 0.9493 loss_giou_object_2: 0.8187 loss_relation_2: 0.8978 loss_bbox_relation_2: 0.6308 loss_giou_relation_2: 1.448 loss_ce_subject_3: 0.5889 loss_bbox_subject_3: 0.5996 loss_giou_subject_3: 0.6473 loss_ce_object_3: 0.6157 loss_bbox_object_3: 0.9193 loss_giou_object_3: 0.7686 loss_relation_3: 0.9056 loss_bbox_relation_3: 0.6214 loss_giou_relation_3: 1.439 loss_ce_subject_4: 0.5663 loss_bbox_subject_4: 0.6001 loss_giou_subject_4: 0.6327 loss_ce_object_4: 0.6133 loss_bbox_object_4: 0.8831 loss_giou_object_4: 0.7438 loss_relation_4: 0.9045 loss_bbox_relation_4: 0.5826 loss_giou_relation_4: 1.412 time: 0.4729 data_time: 0.0044 lr: 0.0001 max_mem: 6428M

Box1 tensor([[0.4292, 0.2617, 0.5044, 0.6411], [0.8760, 0.3628, 0.9570, 0.7251], [0.4785, 0.3682, 0.9717, 0.6631], ..., [ nan, nan, nan, nan], [ nan, nan, nan, nan], [ nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16) tensor([], device='cuda:0', dtype=torch.float16) tensor([], device='cuda:0', dtype=torch.float16) tensor([], device='cuda:0', size=(0, 2), dtype=torch.int64) tensor(nan, device='cuda:0', dtype=torch.float16) tensor(nan, device='cuda:0', dtype=torch.float16) tensor([[300, 0], [300, 1], [300, 2], ..., [599, 1], [599, 2], [599, 3]], device='cuda:0') ERROR [10/20 13:46:46 d2.engine.train_loop]: Exception during training:

Traceback (most recent call last): File "/root/autodl-tmp/conda/envs/SpeaQ/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/root/autodl-tmp/conda/envs/SpeaQ/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/root/autodl-tmp/conda/envs/SpeaQ/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 395, in run_step loss_dict = self.model(data) File "/root/autodl-tmp/conda/envs/SpeaQ/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/autodl-tmp/SpeaQ/../SpeaQ/modeling/meta_arch/detr.py", line 274, in forward loss_dict = self.criterion(output, targets) File "/root/autodl-tmp/conda/envs/SpeaQ/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/autodl-tmp/SpeaQ/../SpeaQ/modeling/transformer/criterion.py", line 567, in forward combined_indices, k_mean_log, augmented_targets = self.matcher.forward_relation(outputs, targets, layer_num=n_layers) File "/root/autodl-tmp/conda/envs/SpeaQ/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, kwargs) File "/root/autodl-tmp/SpeaQ/../SpeaQ/modeling/transformer/matcher.py", line 565, in forward_relation cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_boxes)) File "/root/autodl-tmp/SpeaQ/../SpeaQ/modeling/transformer/util/box_ops.py", line 52, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all() AssertionError`

It seems that there are some bbox predictions that contains many 'nan' in it, have you got the same error?

jonghakim35 commented 1 month ago

It seems the problem is due to instability in the early training stages. I’ve noticed similar errors pop up when using smaller batch sizes, though it was not very common.

Increasing the batch size might resolve the problem, but if you’re short on GPU resources, you could try using gradient accumulation to simulate a larger batch size.

wishforgood commented 1 month ago

Ok, thanks a lot for your quick advice, I will try it and see if it will happen again.

mlvlab / SpeaQ

problem on reproducing #6