microsoft / RegionCLIP

[CVPR 2022] Official code for "RegionCLIP: Region-based Language-Image Pretraining"
Apache License 2.0
696 stars 52 forks source link

Reproduction of Region classification in Fig.1 #73

Closed HatakeKiki closed 1 year ago

HatakeKiki commented 1 year ago

Thanks for you inspiring work! I'm reproducing the region classification results of your paper. I tried several CLIP models but the results on LVIS lags.

The models used: vanilla CLIP model RN50, RN50x4, ViT-B-32 The prompt templates used: templates = [ 'itap of a {}.', 'a bad photo of the {}.', 'a origami {}.', 'a photo of the large {}.', 'a {} in a video game.', 'art of the {}.', 'a photo of the small {}.', ] The metric used: top-1 accuracy The dataset split: results obtained on official validation set The results: ImageNet: 53.34, 59.71(seems pretty close to 59.6 as reported in the Fig.1 (b)), 56.34 LVIS: 7.58, 9.68, 11.93, which all seem to be far worse than 19.1 as reported in Fig. 1(b).

For LVIS, I used load_lvis_json for data loading and I cropped the images with gt 2d bboxes. I also tried with a single prompt of "a photo of a {class_name}" and the text embedding downloaded in this repo but the results are slightly worse. Could you provided more details about the region classification experiments?

Specifically, 1. which model did you use? 2. What text prompt did you use? 3. Is there anything wrong in my steps?

Again, thanks for your patience.

YiwuZhong commented 1 year ago

Hi @HatakeKiki, thanks for your interest in our work. I used ResNet-50 (see Table 11 of CLIP paper). The text prompts include ~80 templates. It can be found in this codebase. To further improve region classification, I did some augmentation (e.g., cropping a larger region with different scales like 1.2/1.5/2.0). It's highly sensitive depending on the augmentation you used. But the accuracy will be always low (below 20).