simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

Feature representations for external videos #51

Open arelhossan opened 2 years ago

arelhossan commented 2 years ago

Hi, thank you for sharing your work and congratulations on the paper!

I am trying to use COOT to create video descriptions for videos that aren't in ActivityNet. I saw your comment on creating 100M features for videos. However, when checking the .npy files, the shapes are always (n, 1024) or (n, 2048). Since S3D produces 512-dim vectors, why are the npy files in these shapes. Sorry if I am missing something; I just need advice on using your model to create video descriptions.

Thanks in advance!

simon-ging commented 2 years ago

Hi, 2048-dim means "Inception" features, these can be downloaded from the authors of the CMHSE paper. Models trained on these of course can not work with 100M features, and 100M features don't work so well for ActivityNet. So you could check the CMHSE repository https://github.com/Sha-Lab/CMHSE and find out how they made their features and create similar features for your own dataset.

Otherwise you could try using the YouCook2 model which works with 512-dim 100M features, however this might not work well outside the cooking domain.

I am not sure which 1024-dim features you refer to, please post the full path of the file and some context.

Best,

Simon