linjieli222 / HERO_Video_Feature_Extractor

Video Feature Extraction Code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
https://arxiv.org/abs/2005.00200
MIT License
95 stars 13 forks source link

visual feature extration #7

Closed nqx12348 closed 1 year ago

nqx12348 commented 1 year ago

Hi @linjieli222,

Thanks for your awesome work! I extract resnet and slowfast features of TVQA using this codebase, but after comparison, I find the feature I get is quite different from the feature I downloaded from HEROhttps://convaisharables.blob.core.windows.net/hero/video_db/tv.tar). I use the feature I extracted to train HERO & CONQUER, get quite lower results compared to the downloaded feature, and I can't find out the reason.

Specifically, I extract resnet and slowfast features following the command you offered, but modify clip_len to 3/2: python extract_feature/extract.py --dataflow --csv /output/csv/slowfast_info.csv \ --batch_size 45 --num_decoding_thread 4 --clip_len 3/2\ TEST.CHECKPOINT_FILE_PATH /models/SLOWFAST_8x8_R50.pkl for slowfast, and python extract.py --csv /output/csv/resnet_info.csv --num_decoding_thread 4 --clip_len 3/2 for resnet.

After extraction, I compare the feature I extracted and the downloaded feature by measuring cosine similarities of each frame, I get the similarity of resnet features of around 0.91, and the similarity of slowfast features of around 0.85.

I have some questions:

  1. When extracting features of TVQA, did you use raw videos or videos of low frame rate as input? I can't find raw videos of TVQA on the Internet, so I use extracted frames of 3fps as input, and I noticed in this codebase ffmpeg is used to interpolate the video to 30 fps by default.
  2. Except for the arguments specified in the above commands, did you use the default setting of arguments in resnet/slowfast feature extractor? For example, --pix_fmt='rgb24', --half_precision=1, and l2_normalize=1 by default. (Especially, I find my TVQA 3fps videos are of 'yuvj420p' format, and I tried to use --pix_fmt='yuv420p' but get close results. I don't know whether ffmpeg will automatically convert the format when reading the video).

Or can you find any possible reasons for this? I used a RTX 3090 GPU to extract the features. I tried many times changing settings and can't get feature with high similaritiy as the downloaded feature. If you have any ideas please reply soon, it's very important for me and I would be very appreciative!

best, nqx12348