ttgeng233 / UnAV

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)
https://unav100.github.io
MIT License
52 stars 3 forks source link

Vggish Feature size #7

Closed 1980x closed 5 months ago

1980x commented 5 months ago

Hi. I am trying to extract visual and audio features on raw video clips. For visual features,
python main.py stack_size=24 step_size=8 extraction_fps=25 feature_type=i3d feature dimension for videos matches with that of already shared by you. Eg. it gives 112x1024 rgb and flow features which matches with that of yours.

But for audio features, after converting the video fps to 25 and without converting fps, python main.py feature_type=vggish produces features which don't match with that of shared by you. Eg. It gives 32x128 dim feature only. Can you please tell what needs to be done so that I can get same 112x128 audio feature?

Thank you.

1980x commented 5 months ago

Visual and Audio Features shared by you have the same first dimension. Is it necessary?

ttgeng233 commented 5 months ago

Yes, I kept the vggish features having the same lengths with corresponding visual features. You need to change the stride when extracting vggish features by changing "EXAMPLE_HOP_SECONDS = 0.96 " to 0.32 in https://github.com/v-iashin/video_features/blob/master/models/vggish/vggish_src/vggish_params.py. Because for visual features, fps=25, window_size=24 and stride=8 which equals to window_size=0.96s and stride=0.32.

1980x commented 5 months ago

Thanks. This worked.