Padding conflict in loss computation

iamgroot42 commented 1 year ago

Hi,

Great tutorial! I had a question regarding the data-processing step for this tutorial, where the label tokens are padded with -100 (black-space) before being passed on to the model. Upon running the de-bugger I see that the model makes correct predictions, but predicts tokenizer.pad_token_id (which corresponds to 50256 for Whisper) which leads to different losses depending on what value this padding is done with.

Should the padding not correspond to 50256, and not -100?

One of the comments said # replace padding with -100 to ignore loss correctly but doing so actually yields a higher loss for a prediction that is correct (before fine-tuning has even begun) but has pad-token-ids at the end instead of -100, as expected in the output tensor.

osanseviero commented 1 year ago

cc @Vaibhavs10

iamgroot42 commented 1 year ago

I didn't realize that the NLL loss ignores '-100' labels in its loss computations (I thought it was an arbitrary value in padding), which explains a difference in loss values when using the pad token instead. My bad!

huggingface / community-events

Padding conflict in loss computation #182