About the audio-text pair of AudioSet dataset.

microsoft / Pengi

An Audio Language model for Audio Tasks

https://arxiv.org/abs/2305.11834

MIT License

269 stars 15 forks source link

Open blue-blue272 opened 3 weeks ago

blue-blue272 commented 3 weeks ago

AudioSet only contains audio and event labels. How do you obtain the caption description for audios in the audioset dataset?

soham97 commented 2 weeks ago

Hi @blue-blue272, we use two ways to get captions for AudioSet:

We crawl YouTube titles for the corresponding YouTube videos. The YouTube titles may or may not be related to the audio. Then we use MS CLAP to filter out not aligned audio and YouTube titles. The filtered out data comes to be around ~400k audio-text pairs
We use K2C augmentation proposed in the LAION paper (https://arxiv.org/abs/2211.06687) to generate captions for the rest of the AudioSet