tsujuifu / pytorch_violet

A PyTorch implementation of VIOLET
137 stars 6 forks source link

zero-shot evaluation (video retrieval) #15

Closed avinashsai closed 1 year ago

avinashsai commented 1 year ago

Hello,

Congratulations on the amazing work. I have a few questions about zero-shot evaluation in Table-1.

  1. Which checkpoint is used for zero-shot evaluation?
  2. The retrieval model has fully connected layers on the top of VIOLET Base model? In zero-shot evaluation, are these layers randomly initialized?
  3. If I need separate video and text features, which layer outputs are the most suitable (EncImg / EncTxt /Cross transformer)?

Thank you.

tsujuifu commented 1 year ago

Hi Avinash,

  1. The pre-trained one is used for zero-shot evaluation.
  2. We use the FC layer (for VTM), which is trained during the pre-training.
  3. Here should be the separate features before cross-modal fusion.
avinashsai commented 1 year ago

Hi Tsu-Jui,

Thanks for your reply.

  1. Just curious about this: If VIOLET is to evaluated on a totally different multi-modal (requires both video & text features) task in a zero-shot setting, which features are recommended, (video - text or video - cross modal)
tsujuifu commented 1 year ago

If that is the case, I will suggest using the fused features, which consider both vision and language perception.

avinashsai commented 1 year ago

You mean video outputs and fused features?

tsujuifu commented 1 year ago

Yes