Can't Reproduce MDETR Results on RefEgo Dataset

Description: I have encountered an issue while training Mdetr model on the RefEgo dataset, using the provided training script and checkpoints from the MDETR repository. The training process exhibits abnormal behavior, as evidenced by unusual losses and accuracy metrics. I have thoroughly investigated and think the problem may related to the RefEgo dataset, as documented in issue #3.

Reproducible Steps:

Clone the RefEgo repository.
Download the datasets, annotations, and checkpoints.

Execute the following training command:

python -m torch.distributed.launch --nproc_per_node=2 --use_env \
main.py --dataset_config configs/refego.json --batch_size 4 \
--backbone timm_tf_efficientnet_b3_ns \
--output-dir ./logs/refego_mdetr_conventional \
--ema \
--num_workers 8 \
--load ./data/models/refcoco_EB3_checkpoint.pth

Observe abnormal training logs, including high losses and inaccuracies over multiple epochs.

Training Log Example (Last Epoch):

{"train_lr": 1.000000000000675e-05, "train_lr_backbone": 1.000000000000523e-06, "train_lr_text_encoder": 5.050627595539173e-06, "train_loss": 78.30173288709449, "train_loss_bbox": 0.27182210808759033, "train_loss_bbox_0": 0.29860074482406446, "train_loss_bbox_1": 0.28780467327516807, "train_loss_bbox_2": 0.28531533927775127, "train_loss_bbox_3": 0.2793179191932046, "train_loss_bbox_4": 0.27675076348995115, "train_loss_ce": 7.304420197837443, "train_loss_ce_0": 8.011982987034141, "train_loss_ce_1": 7.906185426281401, "train_loss_ce_2": 7.6297045538092005, "train_loss_ce_3": 7.441227416940423, "train_loss_ce_4": 7.3022214620186885, "train_loss_contrastive_align": 3.972990080636786, "train_loss_contrastive_align_0": 4.982065125579495, "train_loss_contrastive_align_1": 4.748782848893309, "train_loss_contrastive_align_2": 4.473072875582713, "train_loss_contrastive_align_3": 4.227163594424938, "train_loss_contrastive_align_4": 4.020980517791813, "train_loss_giou": 0.7350071375131029, "train_loss_giou_0": 0.8038724736468541, "train_loss_giou_1": 0.7747753153828608, "train_loss_giou_2": 0.7655554380203755, "train_loss_giou_3": 0.7550705310399104, "train_loss_giou_4": 0.7470433832667334, "train_cardinality_error_unscaled": 1.442574200254071, "train_cardinality_error_0_unscaled": 2.717037186742118, "train_cardinality_error_1_unscaled": 2.3767553990068135, "train_cardinality_error_2_unscaled": 2.1197251414713016, "train_cardinality_error_3_unscaled": 1.862859452592678, "train_cardinality_error_4_unscaled": 1.5889392539554221, "train_loss_bbox_unscaled": 0.054364421616115544, "train_loss_bbox_0_unscaled": 0.05972014896959696, "train_loss_bbox_1_unscaled": 0.05756093464827913, "train_loss_bbox_2_unscaled": 0.05706306784735884, "train_loss_bbox_3_unscaled": 0.05586358382912872, "train_loss_bbox_4_unscaled": 0.05535015270117387, "train_loss_ce_unscaled": 7.304420197837443, "train_loss_ce_0_unscaled": 8.011982987034141, "train_loss_ce_1_unscaled": 7.906185426281401, "train_loss_ce_2_unscaled": 7.6297045538092005, "train_loss_ce_3_unscaled": 7.441227416940423, "train_loss_ce_4_unscaled": 7.3022214620186885, "train_loss_contrastive_align_unscaled": 3.972990080636786, "train_loss_contrastive_align_0_unscaled": 4.982065125579495, "train_loss_contrastive_align_1_unscaled": 4.748782848893309, "train_loss_contrastive_align_2_unscaled": 4.473072875582713, "train_loss_contrastive_align_3_unscaled": 4.227163594424938, "train_loss_contrastive_align_4_unscaled": 4.020980517791813, "train_loss_giou_unscaled": 0.36750356875655144, "train_loss_giou_0_unscaled": 0.40193623682342705, "train_loss_giou_1_unscaled": 0.3873876576914304, "train_loss_giou_2_unscaled": 0.38277771901018776, "train_loss_giou_3_unscaled": 0.3775352655199552, "train_loss_giou_4_unscaled": 0.3735216916333667, "test_refego_loss": 139.12848654223927, "test_refego_loss_bbox": 0.5160502417770455, "test_refego_loss_bbox_0": 0.5127308963977238, "test_refego_loss_bbox_1": 0.5578339469033827, "test_refego_loss_bbox_2": 0.5505417347118988, "test_refego_loss_bbox_3": 0.528432327205659, "test_refego_loss_bbox_4": 0.5215435248158456, "test_refego_loss_ce": 13.302537113310224, "test_refego_loss_ce_0": 13.786419380119302, "test_refego_loss_ce_1": 13.363976683123605, "test_refego_loss_ce_2": 13.191718044279055, "test_refego_loss_ce_3": 13.210718351323678, "test_refego_loss_ce_4": 13.264219968080065, "test_refego_loss_contrastive_align": 7.968821568031165, "test_refego_loss_contrastive_align_0": 8.623145409209913, "test_refego_loss_contrastive_align_1": 8.331567162284259, "test_refego_loss_contrastive_align_2": 8.147967129307512, "test_refego_loss_contrastive_align_3": 8.088242396523269, "test_refego_loss_contrastive_align_4": 7.98730907007562, "test_refego_loss_giou": 1.0855601225376064, "test_refego_loss_giou_0": 1.1131093147453548, "test_refego_loss_giou_1": 1.1419892626579649, "test_refego_loss_giou_2": 1.1276014636307161, "test_refego_loss_giou_3": 1.1115133605014251, "test_refego_loss_giou_4": 1.0949381509824496, "test_refego_cardinality_error_unscaled": 1.2291638248533625, "test_refego_cardinality_error_0_unscaled": 2.6737655163006413, "test_refego_cardinality_error_1_unscaled": 2.1714806984040376, "test_refego_cardinality_error_2_unscaled": 1.9240553812576728, "test_refego_cardinality_error_3_unscaled": 1.7368196698949665, "test_refego_cardinality_error_4_unscaled": 1.3628256718046652, "test_refego_loss_bbox_unscaled": 0.10321004827097902, "test_refego_loss_bbox_0_unscaled": 0.10254617936222173, "test_refego_loss_bbox_1_unscaled": 0.1115667893844877, "test_refego_loss_bbox_2_unscaled": 0.110108346940525, "test_refego_loss_bbox_3_unscaled": 0.10568646537939087, "test_refego_loss_bbox_4_unscaled": 0.1043087050028815, "test_refego_loss_ce_unscaled": 13.302537113310224, "test_refego_loss_ce_0_unscaled": 13.786419380119302, "test_refego_loss_ce_1_unscaled": 13.363976683123605, "test_refego_loss_ce_2_unscaled": 13.191718044279055, "test_refego_loss_ce_3_unscaled": 13.210718351323678, "test_refego_loss_ce_4_unscaled": 13.264219968080065, "test_refego_loss_contrastive_align_unscaled": 7.968821568031165, "test_refego_loss_contrastive_align_0_unscaled": 8.623145409209913, "test_refego_loss_contrastive_align_1_unscaled": 8.331567162284259, "test_refego_loss_contrastive_align_2_unscaled": 8.147967129307512, "test_refego_loss_contrastive_align_3_unscaled": 8.088242396523269, "test_refego_loss_contrastive_align_4_unscaled": 7.98730907007562, "test_refego_loss_giou_unscaled": 0.5427800612688032, "test_refego_loss_giou_0_unscaled": 0.5565546573726774, "test_refego_loss_giou_1_unscaled": 0.5709946313289824, "test_refego_loss_giou_2_unscaled": 0.5638007318153581, "test_refego_loss_giou_3_unscaled": 0.5557566802507126, "test_refego_loss_giou_4_unscaled": 0.5474690754912248, "test_refego_coco_eval_bbox": [0.07502982787742234, 0.1751281739735873, 0.053993559869210596, 0.004996312768375743, 0.06685121168626103, 0.14618921694905845, 0.0904576768296842, 0.23530454948502827, 0.3136808539663051, 0.11416459884201821, 0.2918827322979732, 0.4532020234635669], "epoch": 4, "n_parameters": 152581036}

Custom RefEgoEvaluator: I have implemented a custom RefEgoEvaluator to evaluate the model's performance on the RefEgo dataset. This evaluator has been tested with the Refcoco checkpoint and produces expected results.

class RefEgoEvaluator(object):
    def __init__(self, refexp_gt, iou_types, k=(1, 5, 10), thresh_iou=0.5):
        assert isinstance(k, (list, tuple))
        refexp_gt = copy.deepcopy(refexp_gt)
        self.refexp_gt = refexp_gt
        self.iou_types = iou_types
        self.img_ids = self.refexp_gt.imgs.keys()
        self.predictions = {}
        self.k = k
        self.thresh_iou = thresh_iou

    def accumulate(self):
        pass

    def update(self, predictions):
        self.predictions.update(predictions)

    def synchronize_between_processes(self):
        all_predictions = dist.all_gather(self.predictions)
        merged_predictions = {}
        for p in all_predictions:
            merged_predictions.update(p)
        self.predictions = merged_predictions

    def summarize(self):
        if dist.is_main_process():
            scores = {k: 0.0 for k in self.k}
            dataset2count = 0.0
            miou = 0
            self.img_ids = self.predictions.keys()
            for image_id in tqdm(self.img_ids):
                ann_ids = self.refexp_gt.getAnnIds(imgIds=image_id)
                # assert len(ann_ids) == 1
                img_info = self.refexp_gt.loadImgs(image_id)[0]

                target = self.refexp_gt.loadAnns(ann_ids[0])
                prediction = self.predictions[image_id]
                assert prediction is not None
                sorted_scores_boxes = sorted(
                    zip(prediction["scores"].tolist(), prediction["boxes"].tolist()), reverse=True
                )
                sorted_scores, sorted_boxes = zip(*sorted_scores_boxes)
                sorted_boxes = torch.cat([torch.as_tensor(x).view(1, 4) for x in sorted_boxes])
                target_bbox = target[0]["bbox"]
                converted_bbox = [
                    target_bbox[0],
                    target_bbox[1],
                    target_bbox[2] + target_bbox[0],
                    target_bbox[3] + target_bbox[1],
                ]
                miou += torchvision.ops.box_iou(sorted_boxes[0].view(-1, 4), torch.as_tensor(converted_bbox).view(-1, 4))
                giou = generalized_box_iou(sorted_boxes, torch.as_tensor(converted_bbox).view(-1, 4))
                for k in self.k:
                    if max(giou[:k]) >= self.thresh_iou:
                        scores[k] += 1.0
                dataset2count += 1.0

            for k in self.k:
                try:
                    scores[k] /= dataset2count
                except:
                    pass
            miou = miou/dataset2count
            results = {}
            results["precision"] = scores.values()
            results["miou"] = float(miou)
            print(f"Precision @ 1, 5, 10: {scores.values()}, mIoU: {miou} \n")

            return results
        return None

Issue with RefEgo Dataset: Upon inspecting the training process on the RefEgo dataset, I observed that the model trained on RefEgo does not perform as expected, unlike on RefCOCO. I suspect a potential issue with the RefEgo dataset.

Validation and Train datasets evaluation Results: The following are the results of the val set, and I have omitted some of the results:

{"test_refego_loss": 122.1585885467779, "test_refego_loss_ce": 10.903867366661222,"test_refego_precision": "dict_values([0.20349907918968693, 0.321601527863038, 0.3740536116226724])", "test_refego_miou": 0.2171, "n_parameters": 152581036}

I even conducted an evaluation on the training set:

{"test_refego_loss": 94.43644996643066, "test_refego_loss_ce": 7.773279542922974, "test_refego_loss_bbox": 0.5113364785909653, "test_refego_loss_giou": 1.248047850728035, "test_refego_loss_contrastive_align": 5.031788594722748, "test_refego_precision": "dict_values([0.20125, 0.3425, 0.395])", "test_refego_miou": 0.24258069694042206, "n_parameters": 152581036}

Conclusion: It may be a problem with the training code or the refego dataset. The code has successfully yielded normal results on the RefCOCOg dataset, implying a possible anomaly with the RefEgo dataset. I've raised a corresponding issue (#3) titled "Extracted images do not match provided bounding box annotations," I kindly request your expertise and assistance in resolving this matter. Thank you for your attention and support.

shuheikurita / RefEgo

Can't Reproduce MDETR Results on RefEgo Dataset #5