tgc1997 / RMN

IJCAI2020: Learning to Discretely Compose Reasoning Module Networks for Video Captioning
79 stars 12 forks source link

How to get my own extracted-features? #5

Closed takuyara closed 3 years ago

takuyara commented 3 years ago

Hi tgc, I'd like to test this model on my own video. How could I get the extracted features as inputs?

tgc1997 commented 3 years ago

Hi takuyara, we extracted our features by I3D, InceptionResNetV2 , and BUTD

takuyara commented 3 years ago

Thanks for your quick reply!

PipiZong commented 3 years ago

Hi takuyara, we extracted our features by I3D, InceptionResNetV2 , and BUTD

Hi tgc,

I learned the I3D code you provided above. I am wondering how to set max_interval and overlap to get equally-spaced 26 features for each video? Or it's no need to set these two parameters, and we just need to extract around 209 frames as the input of the i3d model?

tgc1997 commented 3 years ago

Hi takuyara, we extracted our features by I3D, InceptionResNetV2 , and BUTD

Hi tgc,

I learned the I3D code you provided above. I am wondering how to set max_interval and overlap to get equally-spaced 26 features for each video? Or it's no need to set these two parameters, and we just need to extract around 209 frames as the input of the i3d model?

We first set max_interval=64, overlap=8 to extract features and then sample 26 of them.

PipiZong commented 3 years ago

e first set max_interval=64, overlap=8 to extract features and then sa

Hi tgc,

Thanks for your reply! Sorry to bother you again, I still have 2 questions about this, 1) why do you need to set max_interval and overlap? If you just input 209 frames as the 'clip', you can exactly get 1x26x1024 as the feature by 'features = get_features(clip, i3d_rgb)'. So these two parameters are used for saving the computation cost? If not, how to determine them specifically? 2) For each video, 2D features (irv2, 1x26x1536) are extracted from 26 images (frames). And i3d features are extracted from 26 segments, which is not the same as the 26 frames. Is it good to concatenate these two features in this dimension (26)? For example, 1x26x2560 cannot be represented as " one video has 26 frames, each frame has 2560 features".

tgc1997 commented 3 years ago

Hi tgc,

Thanks for your reply! Sorry to bother you again, I still have 2 questions about this,

  1. why do you need to set max_interval and overlap? If you just input 209 frames as the 'clip', you can exactly get 1x26x1024 as the feature by 'features = get_features(clip, i3d_rgb)'. So these two parameters are used for saving the computation cost? If not, how to determine them specifically?
  2. For each video, 2D features (irv2, 1x26x1536) are extracted from 26 images (frames). And i3d features are extracted from 26 segments, which is not the same as the 26 frames. Is it good to concatenate these two features in this dimension (26)? For example, 1x26x2560 cannot be represented as " one video has 26 frames, each frame has 2560 features".
  1. Maybe the frames of some videos are less than 209, and 209 frames may miss important information for those long videos with much more than 209 frames. These two parameters are used for saving the computation cost.
  2. It is difficult to perfectly align 2d and 3d features, if you know how to do it, you are welcome to comment.
PipiZong commented 3 years ago

Hi tgc, Thanks for your reply! Sorry to bother you again, I still have 2 questions about this,

  1. why do you need to set max_interval and overlap? If you just input 209 frames as the 'clip', you can exactly get 1x26x1024 as the feature by 'features = get_features(clip, i3d_rgb)'. So these two parameters are used for saving the computation cost? If not, how to determine them specifically?
  2. For each video, 2D features (irv2, 1x26x1536) are extracted from 26 images (frames). And i3d features are extracted from 26 segments, which is not the same as the 26 frames. Is it good to concatenate these two features in this dimension (26)? For example, 1x26x2560 cannot be represented as " one video has 26 frames, each frame has 2560 features".
  1. Maybe the frames of some videos are less than 209, and 209 frames may miss important information for those long videos with much more than 209 frames. These two parameters are used for saving the computation cost.
  2. It is difficult to perfectly align 2d and 3d features, if you know how to do it, you are welcome to comment.

Thanks for your explanations!

PipiZong commented 3 years ago

Hi tgc, Thanks for your reply! Sorry to bother you again, I still have 2 questions about this,

  1. why do you need to set max_interval and overlap? If you just input 209 frames as the 'clip', you can exactly get 1x26x1024 as the feature by 'features = get_features(clip, i3d_rgb)'. So these two parameters are used for saving the computation cost? If not, how to determine them specifically?
  2. For each video, 2D features (irv2, 1x26x1536) are extracted from 26 images (frames). And i3d features are extracted from 26 segments, which is not the same as the 26 frames. Is it good to concatenate these two features in this dimension (26)? For example, 1x26x2560 cannot be represented as " one video has 26 frames, each frame has 2560 features".
  1. Maybe the frames of some videos are less than 209, and 209 frames may miss important information for those long videos with much more than 209 frames. These two parameters are used for saving the computation cost.
  2. It is difficult to perfectly align 2d and 3d features, if you know how to do it, you are welcome to comment.

Sorry to bother you again! I am still confused about the steps to determine the 'max_interval' and 'overlap'. Could you please give an example, like when we have 2 videos, and one is 10-min long at 25 FPS while another is 8-min long at 30 FPS? Many thanks!

monic7 commented 3 years ago

Hi, I have tried to extract features using I3D, IRV2 and BUTD like you mentioned but I am not able to get the same features as you. The features that I obtained seem to be very different from the provided h5 file...

How to get the same features as produced in the h5 file?

How were the equally spaced frames selected? Is it by the following method: index = [int(ceil(i*len(l)/26)) for i in range(26)] Are the equally spaced frames only needed for irv2 and butd, while i3d takes whole video as input to generate 43 x 1024 for msvd videos then equally spaced by the same method as above?

May I know what other steps are required during extraction? Thank you very much!