simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

Have you tried using 'mixed_5c' as frame embeddings for howto100m model? #31

Closed andylin12 closed 3 years ago

andylin12 commented 3 years ago

I tracked down on howto100m pretrained model and found the following from it's tensorflow model page:

Note that vision_output is a dictionary which contains two keys:

mixed_5c: This is the global averaged pooled feature from S3D of dimension 1024. This should be use for classification on downstream tasks.

video_embedding: This is the video embedding (size 512) from the joint text-video space. It should be used to compute similarity scores with text inputs using the text embedding.

The above suggests to use the 'mixed_5c' output not the 512 size video embedding for downstream tasks.

Since the text in howto100m model was trained with word2vec, but in coot, you are using BERT, it is more like a downstream task to further process the features.

I am wondering if you have tried with the 'mixed_5c' (average pooled S3D).

Thanks for the awesome and inspiring work. Looking forward to your reply.

ref: tensorflow model page, also mentioned in the howto100m github page: https://tfhub.dev/deepmind/mil-nce/s3d/1

simon-ging commented 3 years ago

Hi, thanks for your input. Yes, we tested 3 versions: the 1024-d features, the 512-d features, and the concatenation of both. The 512-d only worked best.

andylin12 commented 3 years ago

Thanks for your information. I am going to try coot on my own dataset. I think I can extract both features and try different combinations.