Reproduction of Region classification in Fig.1

Thanks for you inspiring work! I'm reproducing the region classification results of your paper. I tried several CLIP models but the results on LVIS lags.

The models used: vanilla CLIP model RN50, RN50x4, ViT-B-32 The prompt templates used: templates = [ 'itap of a {}.', 'a bad photo of the {}.', 'a origami {}.', 'a photo of the large {}.', 'a {} in a video game.', 'art of the {}.', 'a photo of the small {}.', ] The metric used: top-1 accuracy The dataset split: results obtained on official validation set The results: ImageNet: 53.34, 59.71(seems pretty close to 59.6 as reported in the Fig.1 (b)), 56.34 LVIS: 7.58, 9.68, 11.93, which all seem to be far worse than 19.1 as reported in Fig. 1(b).

For LVIS, I used load_lvis_json for data loading and I cropped the images with gt 2d bboxes. I also tried with a single prompt of "a photo of a {class_name}" and the text embedding downloaded in this repo but the results are slightly worse. Could you provided more details about the region classification experiments?

Specifically, 1. which model did you use? 2. What text prompt did you use? 3. Is there anything wrong in my steps?

Again, thanks for your patience.

microsoft / RegionCLIP

Reproduction of Region classification in Fig.1 #73