microsoft / Pengi

An Audio Language model for Audio Tasks
https://arxiv.org/abs/2305.11834
MIT License
281 stars 15 forks source link

Results interpretation #7

Closed alexnevsky13112000 closed 10 months ago

alexnevsky13112000 commented 10 months ago

Hi! I have a question regarding the Pengi framework. I tried to use your model in the task of detecting emotions. I apologize in advance for the possibly obvious question, but tell me please how to interpret the results obtained after the method has been performed ".generate" of the PengiWrapper class? As an output, I get a list of possible emotions and a tensor with negative values sorted in descending order. image I used the audio from the "Happy" class as an example and got the results from the picture above. So how to interpret them? I would really appreciate your help.

soham97 commented 10 months ago

Hi @alexnevsky13112000, the model works as a captioning model i.e. given an audio file and prompt, it will generate a textual description. We use beam search decoding to generate the text output.

The output that you pasted above is the beam search decoding output. The output size is equal to the beam_size you set. Therefore, if the beam_size is 5, you will get 5 text outputs in decreasing order of beam score. The beam score does not correspond to the probability of classes.

TLDR: the way to read the output is the model predicts "happy". If you want probability distribution across your pre-defined classes, you need to look at next token probability in decoding.

Hope this helps.

alexnevsky13112000 commented 10 months ago

Hi @alexnevsky13112000, the model works as a captioning model i.e. given an audio file and prompt, it will generate a textual description. We use beam search decoding to generate the text output.

The output that you pasted above is the beam search decoding output. The output size is equal to the beam_size you set. Therefore, if the beam_size is 5, you will get 5 text outputs in decreasing order of beam score. The beam score does not correspond to the probability of classes.

TLDR: the way to read the output is the model predicts "happy". If you want probability distribution across your pre-defined classes, you need to look at next token probability in decoding.

Hope this helps.

Okay, I understand now. Thanks a lot!