Closed fake-warrior8 closed 2 years ago
You need to make sure to L2 normalize the text and video embeddings before doing dot-product
You need to make sure to L2 normalize the text and video embeddings before doing dot-product
That's right! Thank you! After L2 normalization, the R@1=29.4, R@5=51.7, which is not too different from the result in the paper R@1=31.5, R@5=52.8
Doing L2 norm before mean pooling as well can also help
That's right! Thank you! After L2 normalization, the R@1=29.4, R@5=51.7, which is not too different from the result in the paper R@1=31.5, R@5=52.8
Doing L2 norm before mean pooling as well can also help
Actually I did L2 norm before mean pooling.
# zero_shot requires norm
text_embeds_normed = self.l2_norm(text_embeds) # input_x / input_x.norm(dim=-1, keepdim=True)
vid_embeds_normed = self.l2_norm(vid_embeds)
print('\nraw text embed normed')
# [N, H]
print(text_embeds_normed.shape)
# [N, T, H]
print('\nraw vid embed normed')
print(vid_embeds_normed.shape)
raw_sims = torch.mm(text_embeds_normed, vid_embeds_normed.mean(1).T)
Hi, I evaluated the CLIP zero-shot result on the MSRVTT dataset used your evaluation code (the dot production using text_embeds and vid_embeds (torch.mean on dim=1 ([N, T, H] --> [N, H])), and your similarity function), however, the result is R@1=21.7, R@5=42.7, while the results you given in the Figure 3 of your paper is R@1=31.5, R@5=52.8, is there something wrong with my mean-pool method? By the way, the finetuned result on MSRVTT-9k is comparable to your paper.