PVIC-CLIP - Githubissues

Sorry, we don't have time to organize the code right now. Based on the image and code we published, you should know how CLIPViC is designed. We think that two parts can affect the performance: the input preprocessing and the device.

Input Preprocessing:
self.cliptrans = T.Compose([
            T.IResize([224, 224]),
            # T.IResize([336,336]),
            T.ToTensor(),
            T.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711]),
        ])
########################################################
        image0, target0 = self.transforms(image, target)
        image1, target1 = self.normalize(image0, target0)
        clipimg, _ = self.cliptrans(image0, None)

        # return image, target
        return image1, target1, clipimg

Different devices can lead to small differences, which we think are normal. In addition, previous work has demonstrated the capabilities of the CLIP model. The CLIP branch alone (B/16, 35.84% mAP) outperforms PViC (34.69% mAP).

For zero-shot inference, we refer to ADA-CM, GEN-VLKT, and HOICLIP to solve the code problem step by step. The dataset is split according to GEN-VLKT. Then, the rare and non-rare evaluations are transformed into unseen and seen evaluations.

hutuo1213 / CLIPViC

PVIC-CLIP #1