An error while training

jonghakim35 commented 1 year ago

Hi, thanks for sharing your great work with codes.

While trying to reproduce results from the papers, an error below occurred.

Traceback (most recent call last):
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
    main_func(*args)
  File "/data3/jongha/ICCV2023/IterativeSG/train_iterative_model.py", line 51, in main
    return trainer.train()
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 395, in run_step
    loss_dict = self.model(data)
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data3/jongha/ICCV2023/IterativeSG/../IterativeSG/modeling/meta_arch/detr.py", line 267, in forward
    loss_dict = self.criterion(output, targets)
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data3/jongha/ICCV2023/IterativeSG/../IterativeSG/modeling/transformer/criterion.py", line 552, in forward
    combined_indices = self.matcher.forward_relation(outputs, targets)
  File "/home/jongha/.conda/envs/sg/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/data3/jongha/ICCV2023/IterativeSG/../IterativeSG/modeling/transformer/matcher.py", line 207, in forward_relation
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_boxes))
  File "/data3/jongha/ICCV2023/IterativeSG/../IterativeSG/modeling/transformer/util/box_ops.py", line 52, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

Is there any way to fix such an error? Thanks

ShunchiZhang commented 1 year ago

Changing the random seed seems to be a workaround. In my machine (4 x 16 GB V100), it worked with seed 3 and 4.

But this issue still exists.

siddheshk commented 1 year ago

The instability might be due to training the model from scratch. For faster convergence, please use the DETR weights pretrained on Visual Genome Object Detection. I've updated the README with links to the pretrained weights. I've also provided a link to the final model with α=0.07, β=0.75. Thanks.

ubc-vision / IterativeSG

An error while training #2