Closed andylin12 closed 3 years ago
Hi, thanks for your input. Yes, we tested 3 versions: the 1024-d features, the 512-d features, and the concatenation of both. The 512-d only worked best.
Thanks for your information. I am going to try coot on my own dataset. I think I can extract both features and try different combinations.
I tracked down on howto100m pretrained model and found the following from it's tensorflow model page:
The above suggests to use the 'mixed_5c' output not the 512 size video embedding for downstream tasks.
Since the text in howto100m model was trained with word2vec, but in coot, you are using BERT, it is more like a downstream task to further process the features.
I am wondering if you have tried with the 'mixed_5c' (average pooled S3D).
Thanks for the awesome and inspiring work. Looking forward to your reply.
ref: tensorflow model page, also mentioned in the howto100m github page: https://tfhub.dev/deepmind/mil-nce/s3d/1