v-iashin / MDVC

PyTorch implementation of Multi-modal Dense Video Captioning (CVPR 2020 Workshops)
https://v-iashin.github.io/mdvc
142 stars 19 forks source link

Alignment key for the A/V features in the .npy/.hdf5 files #20

Closed amanchadha closed 2 years ago

amanchadha commented 3 years ago

Hi Vladimir,

Long time no talk :) I was wondering if you can share the code that converted the .npy features (from your VGGish and I3D feature extractor) that you made available to me mid last year, to .hdf5 in MDVC Readme: Usage. In particular, I am interested in understanding how you "align" the audio and video features (based on the code below).

Questions:

  1. Are the audio and video features aligned by time in the hdf5 file? Is that what T_audio/T_video stands for?
  2. Is the D_audio/D_video simply the feature dimension?
def load_multimodal_features_from_h5(feat_h5_video, feat_h5_audio, feature_names_list, 
                                     video_id, start, end, duration, get_full_feat=False, cs=True):
    supported_feature_names = {'i3d_features', 'c3d_features', 'vggish_features'}
    assert isinstance(feature_names_list, list)
    assert len(feature_names_list) > 0
    assert set(feature_names_list).issubset(supported_feature_names)

    if 'vggish_features' in feature_names_list:
        audio_stack = feat_h5_audio.get(f'{video_id}/vggish_features')

        # some videos doesn't have audio
        if audio_stack is None:
            print(f'audio_stack is None @ {video_id}')
            audio_stack = torch.empty((0, 128)).float()

        T_audio, D_audio = audio_stack.shape

    if 'i3d_features' in feature_names_list:
        video_stack_rgb = feat_h5_video.get(f'{video_id}/i3d_features/rgb')
        video_stack_flow = feat_h5_video.get(f'{video_id}/i3d_features/flow')

        assert video_stack_rgb.shape == video_stack_flow.shape
        T_video, D_video = video_stack_rgb.shape

        if T_video > T_audio:
            video_stack_rgb = video_stack_rgb[:T_audio, :]
            video_stack_flow = video_stack_flow[:T_audio, :]
            T = T_audio
        elif T_video < T_audio:
            audio_stack = audio_stack[:T_video, :]
            T = T_video
        else:
            # or T = T_audio
            T = T_video

        # at this point they should be the same
        assert audio_stack.shape[0] == video_stack_rgb.shape[0]

Thanks again for your help!

v-iashin commented 3 years ago

Hi 👋 ! Indeed!

I am afraid, I don't have the exact snippet which does .mp4 -> .npy -> .hdf5. However, the procedure is quite straightforward. Someone asked it before and I wrote one from my memory: https://github.com/v-iashin/MDVC/issues/11#issuecomment-645371791. What you need there is the answer to the first question. I didn't run it back then. I just wrote it down in my comment. So, please ensure it won't fail with some errors (check out the following comments on possible bugs). Overall, just extract features with video_features (repo) code and run that script on top of it.

Are the audio and video features aligned by time in the hdf5 file? Is that what T_audio/T_video stands for?

I am not sure what you mean here. T_audio and T_video are temporal dims of features. The I3D features are extracted from 24 frames from a video sampled at 25 fps and temporarily span 0.96 (24/25) seconds with no overlap. At the same time, VGGish features are extracted from 0.96 s audio segments. As you see, both sequences should have the same temporal length if they are extracted from the same video. Therefore, they are aligned. However, sometimes (I don't remember how rarely), they are not equal. In this case, we just trim the longest one to match the shortest one (see the i3d part in the snippet you provided).

Is the D_audio/D_video simply the feature dimension?

Yes, it is.

tkbadamdorj commented 3 years ago

Hi Vladimir,

Do the audio features for each video cover the entire video? Did you filter out the audio segments that are not inside event proposals?

Thank you!

v-iashin commented 3 years ago

Yes, similar to visual features and speech, the audio is available for the entire video. And yes we trim the modalities to be in a segment as shown here

https://github.com/v-iashin/MDVC/blob/df3b88a8bc10271e9501be41cd77e74d13abf79b/dataset/dataset.py#L67