microsoft / Pengi

An Audio Language model for Audio Tasks
https://arxiv.org/abs/2305.11834
MIT License
281 stars 15 forks source link

How to get the likelihood from the model? #15

Closed jasonppy closed 3 days ago

jasonppy commented 2 months ago

Hi authors,

I'm trying to get the likelihood of Pengi on (audio, question, answer) tuples, but wasn't able to do so. Is it possible to have get some help on this?

I think probably this forward function calculated the loss: https://github.com/microsoft/Pengi/blob/main/models/pengi.py#L174 where audio should be the output of preprocess_audio, texts_enc should be the output of running preprocess_text on question, texts_dec should be the output of running preprocess_text on answer. However I wasn't able to loss from the output, even if I pass label = texts_dec['input_ids'] (https://github.com/microsoft/Pengi/blob/main/models/decoder.py#L219) , I still get bugs on dimension when calculating cross_entropy loss

Your help is greatly appreciated.

Best, Puyuan

soham97 commented 2 months ago

Hi @jasonppy, the loss computation during training looks like:

  1. outputs = model(audios, texts_enc, texts_dec) where model is the PENGI model, audios is float32 tensor, texts_enc and texts_dec are tokenized text input and text output.
  2. logits = outputs.logits[:, total_prefix_length - 1: -1] Remove the outputs corresponding to the total prefix length. This is equal to length of audio projection and length of input text
  3. loss = F.cross_entropy(logits.reshape(-1, logits.shape[-1]), texts_dec['input_ids'].flatten(), ignore_index=0) Compute cross entropy per token and average. Swap 0 with whichever token index is used for padding.

For texts_dec in step 1, make sure to add/prepend ones equal to the total prefix length to the attention mask of tokenized text.