Details of zero-shot performance on SSv2

Dear authors,

Great work!

I was wondering if you have the script to reproduce zero-shot numbers on SSv2 (Table 7).

Based on my experiments and also other papers [1, 2], I get 2.7% accuracy on the 174 classes in SSv2 with a frozen CLIP with mean pooling on per-frame features. Could you please elaborate on this discrepancy or what I may be missing?

[1] Videoprompter: an ensemble of foundational models for zero-shot video understanding. https://arxiv.org/pdf/2310.15324 [2] GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? https://arxiv.org/pdf/2311.15732

microsoft / XPretrain

Details of zero-shot performance on SSv2 #44