Open ZhouGuangP opened 7 months ago
Sorry, we don't have time to organize the code right now. Based on the image and code we published, you should know how CLIPViC is designed. We think that two parts can affect the performance: the input preprocessing and the device.
Input Preprocessing:
self.cliptrans = T.Compose([
T.IResize([224, 224]),
# T.IResize([336,336]),
T.ToTensor(),
T.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711]),
])
########################################################
image0, target0 = self.transforms(image, target)
image1, target1 = self.normalize(image0, target0)
clipimg, _ = self.cliptrans(image0, None)
# return image, target
return image1, target1, clipimg
Different devices can lead to small differences, which we think are normal. In addition, previous work has demonstrated the capabilities of the CLIP model. The CLIP branch alone (B/16, 35.84% mAP) outperforms PViC (34.69% mAP).
For zero-shot inference, we refer to ADA-CM, GEN-VLKT, and HOICLIP to solve the code problem step by step. The dataset is split according to GEN-VLKT. Then, the rare and non-rare evaluations are transformed into unseen and seen evaluations.
I also introduced CLIP on the PVIC, but it didn't work well than yours. And I don't know how to do the experiment under zero-shot setting, if you do, can you provide the code? I would really appreciate it.