[MBT] Input Data Format for AudioSet

nku-zhichengzhang commented 1 year ago

Hello,

I'm working on reproduce the results in your paper "Attention Bottlenecks for Multimodal Fusion" and try to implement MBT for other audiovisual video classification tasks.

However, the preprocessing for dataset (e.g. AudioSet, Kinetics-Sounds) is non-trivial, even with the provided examples in ViViT. And the main confusing part is about extracting audio (i.e. spectrogram). The recommended code of DMVR ("DMVR/examples/generate_from_file.py") extracts all-zero signals for audio. Besides, the extracted audio is not for spectrogram. Is there some details I missed?

Could you kindly show the preprocessing case for visual-audio datasets? Thx

yangjiangeyjg commented 1 year ago

Good question!

huangfei00 commented 7 months ago

I encountered the same problem as you did. Did you manage to solve it? Also, for the preprocessing of the Audioset dataset, could you provide some reference code?

google-research / scenic

[MBT] Input Data Format for AudioSet #510