Region Semantic Models Do Not Work Well

First of all, thanks for the great work.

Image caption and dense caption modules all work fine here, however, the region caption module does not seem work well. I tested both edit_anything and ssa models.

For edit_anything model, it returns obviously wrong object descriptions. The following the the test image I input. And the Region Segment module returns

a dog is walking on the floor in a room: [0, 50, 383, 165]; a person riding a skateboard down a street: [234, 49, 149, 166]; a piece of paper with a black background: [0, 0, 64, 110]; a white light switch with a black light: [312, 0, 53, 80]; the moon is seen over the city skyline: [116, 0, 56, 38];

There are clearly no dogs or skateboard in the picture.

For the ssa model, when I add --region_classify_model ssa option and change region_semantic method to use ssa, the method errors out with

│ /share/data/ripl/fjd/Image2Paragraph/models/segment_models/semantic_segment_anything_model.py:14 │
│ 7 in semantic_class_w_mask                                                                       │
│                                                                                                  │
│   144 │   │   │                                                                                  │
│   145 │   │   │   valid_mask_large_crop = mmcv.imcrop(valid_mask.numpy(), np.array([bbox[0], b   │
│   146 │   │   │   scale_large)                                                                   │
│ ❱ 147 │   │   │   top_1_patch_large = torch.bincount(class_ids_patch_large[torch.tensor(valid_   │
│   148 │   │   │   top_1_mask_category = mask_categories[top_1_patch_large.item()]                │
│   149 │   │   │                                                                                  │
│   150 │   │   │   ann['class_name'] = str(top_1_mask_category)                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: The shape of the mask [3, 23] at index 0 does not match the shape of the indexed tensor [23, 3] at index 0

I wonder you have a good way to use region segment methods.

showlab / Image2Paragraph

Region Semantic Models Do Not Work Well #25