microsoft / CLAP

Learning audio concepts from natural language supervision
MIT License
455 stars 35 forks source link

How to reproduce the label prediction of VocalSound dataset and ESC-50 dataset? #29

Closed splinter21 closed 9 months ago

splinter21 commented 9 months ago

I run clap_model.generate_caption and the output is a sentence. How to generate the prediction label, such as "laughter"?

mlxu995 commented 2 months ago

I run clap_model.generate_caption and the output is a sentence. How to generate the prediction label, such as "laughter"?

@splinter21 I meet the same problem, did you solve it?

bmartin1 commented 2 months ago

Hi, if you use CLAP with the "Audio Captioning" capability, you will always get a sentence and not a word. You could use NLP toolkits for example to extract single words such as nouns or actions from the sentence. If you use CLAP under the Zero Shot Classification capability, you have to provide the possible classes a priori.