microsoft / XPretrain

Multi-modality pre-training
Other
472 stars 36 forks source link

Details of zero-shot performance on SSv2 #44

Open bpiyush opened 1 month ago

bpiyush commented 1 month ago

Dear authors,

Great work!

I was wondering if you have the script to reproduce zero-shot numbers on SSv2 (Table 7).

Based on my experiments and also other papers [1, 2], I get 2.7% accuracy on the 174 classes in SSv2 with a frozen CLIP with mean pooling on per-frame features. Could you please elaborate on this discrepancy or what I may be missing?

[1] Videoprompter: an ensemble of foundational models for zero-shot video understanding. https://arxiv.org/pdf/2310.15324 [2] GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? https://arxiv.org/pdf/2311.15732