Why is my CLIP zero-shot result too bad?

layer6ai-labs / xpool

https://layer6ai-labs.github.io/xpool/

113 stars 8 forks source link

Why is my CLIP zero-shot result too bad? #11

Closed fake-warrior8 closed 2 years ago

fake-warrior8 commented 2 years ago

Hi, I evaluated the CLIP zero-shot result on the MSRVTT dataset used your evaluation code (the dot production using text_embeds and vid_embeds (torch.mean on dim=1 ([N, T, H] --> [N, H])), and your similarity function), however, the result is R@1=21.7, R@5=42.7, while the results you given in the Figure 3 of your paper is R@1=31.5, R@5=52.8, is there something wrong with my mean-pool method? By the way, the finetuned result on MSRVTT-9k is comparable to your paper.

NoelVouitsis commented 2 years ago

You need to make sure to L2 normalize the text and video embeddings before doing dot-product

fake-warrior8 commented 2 years ago

You need to make sure to L2 normalize the text and video embeddings before doing dot-product

That's right! Thank you! After L2 normalization, the R@1=29.4, R@5=51.7, which is not too different from the result in the paper R@1=31.5, R@5=52.8

NoelVouitsis commented 2 years ago

Doing L2 norm before mean pooling as well can also help

fake-warrior8 commented 2 years ago

That's right! Thank you! After L2 normalization, the R@1=29.4, R@5=51.7, which is not too different from the result in the paper R@1=31.5, R@5=52.8

Doing L2 norm before mean pooling as well can also help

Actually I did L2 norm before mean pooling.

            # zero_shot requires norm
            text_embeds_normed = self.l2_norm(text_embeds)  # input_x / input_x.norm(dim=-1, keepdim=True)
            vid_embeds_normed = self.l2_norm(vid_embeds)
            print('\nraw text embed normed')
            # [N, H]
            print(text_embeds_normed.shape)
            # [N, T, H]
            print('\nraw vid embed normed')
            print(vid_embeds_normed.shape)
            raw_sims = torch.mm(text_embeds_normed, vid_embeds_normed.mean(1).T)