v-iashin / video_features

Extract video features from raw videos using multiple GPUs. We support RAFT flow frames as well as S3D, I3D, R(2+1)D, VGGish, CLIP, and TIMM models.
https://v-iashin.github.io/video_features
MIT License
499 stars 94 forks source link

Some extracted audio and video features of the same video have different length! #66

Open ttgeng233 opened 2 years ago

ttgeng233 commented 2 years ago

Thanks for your good project! I used the same sample strategy to operate audio data and video frames, e.g., resample all video frames using 25 fps, and use 24 frames one time to extract a feature using i3d. At the same time, one audio feature represents a 0.96 audio clip. But I got different length features, e.g, audio with (162, 128) and video with (165, 1024). the video features length is correct but with the wrong audio feature length. How do I deal with it?

v-iashin commented 2 years ago

Hi.

With the information that you provide, it is hard to give recommendations.

2% of features are missing in one modality - i would just trim it to the shortest sequence (162 in your case).

By the way, is it happening for every video you tried or some videos? Can you calculate the ratio of videos when shape mismatch occur? Is this ratio large enough to worry?

ttgeng233 commented 2 years ago

I extracted features of 3000+ videos, there are 6 videos with longer visual features and 400+ videos with shorter video features than audio features. I think the videos whose visual features are 1 shorter than audio features are reasonable since 1 more frame is needed every time to extract optical flow. But the videos whose visual features are longer than audio features are abnormal. If I directly trim it to the shortest sequence, I'm afraid the two modalities can not correspond with each other well.

v-iashin commented 2 years ago

I think one track (audio or visual) is slightly longer than another one. Maybe something is accumulating somewhere -- hard to tell based on the information you are providing.

Does the difference grow as the video gets longer?