linjieli222 / HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
https://arxiv.org/abs/2005.00200
MIT License
230 stars 34 forks source link

Does the dimension of the input video features have to be 4352? #42

Closed onlyonewater closed 2 years ago

onlyonewater commented 2 years ago

Does the dimension of the input video features have to be 4352? I want to use pre-trained I3D to extract my own dataset, which its feature dimension is 1024.

linjieli222 commented 2 years ago

Sorry about the late response. The feature dimension can be any number that fits to your pre-extracted features. But if you wish to leverage the most out of our pre-trained weights, Slowfast+ResNet101 is preferred. We have explored other features during finetuning in the VALUE paper, results in Table 9 and Section B.1. show that ore-trained weights are transferrable across different vision features.

onlyonewater commented 2 years ago

ok, I get it, thanks, I will have a try!