I was wondering if you have the script to reproduce zero-shot numbers on SSv2 (Table 7).
Based on my experiments and also other papers [1, 2], I get 2.7% accuracy on the 174 classes in SSv2 with a frozen CLIP with mean pooling on per-frame features. Could you please elaborate on this discrepancy or what I may be missing?
Dear authors,
Great work!
I was wondering if you have the script to reproduce zero-shot numbers on SSv2 (Table 7).
Based on my experiments and also other papers [1, 2], I get 2.7% accuracy on the 174 classes in SSv2 with a frozen CLIP with mean pooling on per-frame features. Could you please elaborate on this discrepancy or what I may be missing?
[1] Videoprompter: an ensemble of foundational models for zero-shot video understanding. https://arxiv.org/pdf/2310.15324 [2] GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? https://arxiv.org/pdf/2311.15732