zero-shot evaluation (video retrieval)

tsujuifu / pytorch_violet

A PyTorch implementation of VIOLET

137 stars 6 forks source link

Closed avinashsai closed 1 year ago

avinashsai commented 1 year ago

Hello,

Congratulations on the amazing work. I have a few questions about zero-shot evaluation in Table-1.

Which checkpoint is used for zero-shot evaluation?
The retrieval model has fully connected layers on the top of VIOLET Base model? In zero-shot evaluation, are these layers randomly initialized?
If I need separate video and text features, which layer outputs are the most suitable (EncImg / EncTxt /Cross transformer)?

Thank you.

tsujuifu commented 1 year ago

Hi Avinash,

avinashsai commented 1 year ago

Hi Tsu-Jui,

Thanks for your reply.

Just curious about this: If VIOLET is to evaluated on a totally different multi-modal (requires both video & text features) task in a zero-shot setting, which features are recommended, (video - text or video - cross modal)

tsujuifu commented 1 year ago

If that is the case, I will suggest using the fused features, which consider both vision and language perception.

avinashsai commented 1 year ago

tsujuifu commented 1 year ago

Yes