Closed iamgroot42 closed 1 year ago
cc @Vaibhavs10
I didn't realize that the NLL loss ignores '-100' labels in its loss computations (I thought it was an arbitrary value in padding), which explains a difference in loss values when using the pad token instead. My bad!
Hi,
Great tutorial! I had a question regarding the data-processing step for this tutorial, where the label tokens are padded with
-100
(black-space) before being passed on to the model. Upon running the de-bugger I see that the model makes correct predictions, but predictstokenizer.pad_token_id
(which corresponds to50256
for Whisper) which leads to different losses depending on what value this padding is done with.Should the padding not correspond to
50256
, and not-100
?One of the comments said
# replace padding with -100 to ignore loss correctly
but doing so actually yields a higher loss for a prediction that is correct (before fine-tuning has even begun) but has pad-token-ids at the end instead of-100
, as expected in the output tensor.