google-research / scenic

Scenic: A Jax Library for Computer Vision Research and Beyond
Apache License 2.0
3.24k stars 426 forks source link

[MBT] Input Data Format for AudioSet #510

Open nku-zhichengzhang opened 1 year ago

nku-zhichengzhang commented 1 year ago

Hello,

I'm working on reproduce the results in your paper "Attention Bottlenecks for Multimodal Fusion" and try to implement MBT for other audiovisual video classification tasks.

However, the preprocessing for dataset (e.g. AudioSet, Kinetics-Sounds) is non-trivial, even with the provided examples in ViViT. And the main confusing part is about extracting audio (i.e. spectrogram). The recommended code of DMVR ("DMVR/examples/generate_from_file.py") extracts all-zero signals for audio. Besides, the extracted audio is not for spectrogram. Is there some details I missed?

image

Could you kindly show the preprocessing case for visual-audio datasets? Thx

yangjiangeyjg commented 1 year ago

Good question!

huangfei00 commented 7 months ago

I encountered the same problem as you did. Did you manage to solve it? Also, for the preprocessing of the Audioset dataset, could you provide some reference code?