showlab / Image2Paragraph

[A toolbox for fun.] Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet.
Apache License 2.0
789 stars 53 forks source link

Region Semantic Models Do Not Work Well #25

Open jiadingfang opened 1 year ago

jiadingfang commented 1 year ago

First of all, thanks for the great work.

Image caption and dense caption modules all work fine here, however, the region caption module does not seem work well. I tested both edit_anything and ssa models.

For edit_anything model, it returns obviously wrong object descriptions. The following the the test image I input. image And the Region Segment module returns

a dog is walking on the floor in a room: [0, 50, 383, 165]; a person riding a skateboard down a street: [234, 49, 149, 166]; a piece of paper with a black background: [0, 0, 64, 110]; a white light switch with a black light: [312, 0, 53, 80]; the moon is seen over the city skyline: [116, 0, 56, 38]; 

There are clearly no dogs or skateboard in the picture.

For the ssa model, when I add --region_classify_model ssa option and change region_semantic method to use ssa, the method errors out with

│ /share/data/ripl/fjd/Image2Paragraph/models/segment_models/semantic_segment_anything_model.py:14 │
│ 7 in semantic_class_w_mask                                                                       │
│                                                                                                  │
│   144 │   │   │                                                                                  │
│   145 │   │   │   valid_mask_large_crop = mmcv.imcrop(valid_mask.numpy(), np.array([bbox[0], b   │
│   146 │   │   │   scale_large)                                                                   │
│ ❱ 147 │   │   │   top_1_patch_large = torch.bincount(class_ids_patch_large[torch.tensor(valid_   │
│   148 │   │   │   top_1_mask_category = mask_categories[top_1_patch_large.item()]                │
│   149 │   │   │                                                                                  │
│   150 │   │   │   ann['class_name'] = str(top_1_mask_category)                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: The shape of the mask [3, 23] at index 0 does not match the shape of the indexed tensor [23, 3] at index 0

I wonder you have a good way to use region segment methods.