Feature representations for external videos

simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Apache License 2.0

288 stars 55 forks source link

Hi, 2048-dim means "Inception" features, these can be downloaded from the authors of the CMHSE paper. Models trained on these of course can not work with 100M features, and 100M features don't work so well for ActivityNet. So you could check the CMHSE repository https://github.com/Sha-Lab/CMHSE and find out how they made their features and create similar features for your own dataset.

Otherwise you could try using the YouCook2 model which works with 512-dim 100M features, however this might not work well outside the cooking domain.

I am not sure which 1024-dim features you refer to, please post the full path of the file and some context.

Best,

Simon

simon-ging / coot-videotext

Feature representations for external videos #51