Zero shot Results of ESC50

Hi @xinkez, the CLAP is trained on 128k audio-text pairs in the first stage. In this first stage, both the audio and text encoder are unfrozen and learned. Then this pretrained CLAP is used to perform zero-shot evaluation on downstream datasets, for example, ESC50.

Table 2 shows the impact of different CLAP pretraining configs affects downstream performance. The table shows unfreezing both audio and text encoders leads to higher downstream performance, specifically ESC50 performance of 82.6%. Therefore, the weights of CLAP model released belong to the CLAP pretrained with unfrozen audio and text encoder. The README reflects the same 0.826 number. Hope this helps!

microsoft / CLAP

Zero shot Results of ESC50 #8