microsoft / CLAP

Learning audio concepts from natural language supervision
MIT License
486 stars 38 forks source link

Zero shot Results of ESC50 #8

Closed xinkez closed 1 year ago

xinkez commented 1 year ago

Hi,

I notice that when both the audio encoder and text encoder are unfrozen in the paper, the acc of ESC50 is 0.826 in the Table 2. However, in this REAME, the accuracy of zero-shot evaluation on the ESC50 dataset is 0.826. What is the difference? Thank you in advance.

soham97 commented 1 year ago

Hi @xinkez, the CLAP is trained on 128k audio-text pairs in the first stage. In this first stage, both the audio and text encoder are unfrozen and learned. Then this pretrained CLAP is used to perform zero-shot evaluation on downstream datasets, for example, ESC50.

Table 2 shows the impact of different CLAP pretraining configs affects downstream performance. The table shows unfreezing both audio and text encoders leads to higher downstream performance, specifically ESC50 performance of 82.6%. Therefore, the weights of CLAP model released belong to the CLAP pretrained with unfrozen audio and text encoder. The README reflects the same 0.826 number. Hope this helps!