Frame-level feature - Githubissues

linjieli222 / HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

https://arxiv.org/abs/2005.00200

MIT License

230 stars 34 forks source link

Frame-level feature #6

Closed shinying closed 3 years ago

shinying commented 3 years ago

Hi,

Thanks for making your great work open-sourced. I am trying to do feature extraction myself, and wondering how frame-level feature is encoded with SlowFast. As pre-trained SlowFast receives a fixed number of frames as input for action recognition, did you sample multiple clips from a video at different location, or perform other operations such as pooling or concatenation?

I look forward to your reply.

linjieli222 commented 3 years ago

Hi @shinying,

Sorry for the late reply and thank you for your interests in our project. Our feature extraction code is now released: HERO_Video_Feature_Extractor. The feature pre-processing code (conversion to lmdb) is also updated under here: 1b5d4a4a13ea222fa81ecc623ed58ffeb98b1fa9.

Thanks, Linjie

shinying commented 3 years ago

Hi, @linjieli222

Thanks for your reply and the early release. I have read the code and found the extractor is well designed for extracting features from a video, but I only have video frames as input and thus try modifying video_loader.py. I am wondering if you have any suggestion on setting target_framerate and clip_len. The question is mainly related to preprocessing.py, where, if I understand correctly, extracted frames are padded and sampled to form ceil(the number of extracted frames / (clip_len * target_framerate)) sequences of frames, each with size num_frames. This operation reduce the number of features from the number of frames to a much smaller number, and I found difficult to interpret such result as frame-level features.

Thanks for your help. Shinying

linjieli222 commented 3 years ago

Hi Shinying,

I believe I have answered both questions in another thread.

Another possible solution: In script/convert_videodb.py, we concatenate the slowfast features with the 2d-resnet features and save into a lmdb file. If you already have features extracted, but with a larger # of features per 1.5 seconds, you can downsample them into 1 feature/1.5 seconds (if clip_len is set as 1.5s). However, you may expect some performance degradation as our pre-training features are extracted with high framerate.

shinying commented 3 years ago

Hi Linjie,

Thanks for your reply and help again.

Shinying