linjieli222 / HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
https://arxiv.org/abs/2005.00200
MIT License
230 stars 34 forks source link

when will you release the code to process the video data? #1

Closed youngfly11 closed 3 years ago

linjieli222 commented 3 years ago

Thanks for your interest. We plan to release feature extraction code but cannot guarantee a timeline.

If you are in urgent need of extracting the video features in the same format as HERO, you can follow the following repos to build your own feature extraction pipeline:

  1. SlowFast, we use the pretrained SLOWFAST_8x8_R50 model.
    1. Image-level features from ResNet-152 following Howto100M.

Thanks, Linjie

youngfly11 commented 3 years ago

Hi, linjieli;

Thanks for your reply! I have some questions:

linjieli222 commented 3 years ago

Please find the answers to your questions below:

  1. As mentioned in Appendix A.5 of our paper, we extract video features at a fixed frame rate (TV: 2/3 frame per second, HowTo100M: 1/2 frame per second). For downstream tasks, you can check vfeat_interval in each config to get the corresponding frame rate (frame_rate = 1/vfeat_interval). For example: https://github.com/linjieli222/HERO/blob/bc4aec5af1d8eafeb468e78a033f56cd37210097/config/train-didemo_video_only-4gpu.json#L26

  2. As mentioned in Section 4.1 of our paper, we only cut the HowTo videos into 60s-clip. All other videos are kept as their original length. For example, if a TV video is of length 90-second, they you will get a 3D/2D video feature of length 60.

  3. We use the original fps in SlowFast to get the 3D video feature. Note that the frame rate mentioned above for example 2/3 frame per second means that we get one frame feature every 1.5 seconds. At a high level, a 1.5-second video clip is fed into SlowFast to get a feature vector. And we repeat this process to get the features for the whole video.

Thanks, Linjie