Hi @blue-blue272, we use two ways to get captions for AudioSet:
We crawl YouTube titles for the corresponding YouTube videos. The YouTube titles may or may not be related to the audio. Then we use MS CLAP to filter out not aligned audio and YouTube titles. The filtered out data comes to be around ~400k audio-text pairs
We use K2C augmentation proposed in the LAION paper (https://arxiv.org/abs/2211.06687) to generate captions for the rest of the AudioSet
AudioSet only contains audio and event labels. How do you obtain the caption description for audios in the audioset dataset?