This is a relatively hidden bug and took me quite some time to debug :)
Issue_compute_metrics() in the evaluation loop calculates wrong CER & WER metrics when using a TokenSet where the pad_token_id is not equal to 0. This doesn't affect the loss calculation / training as such, but the logged metrics during training will be wrong and won't match the metrics calculated using model.evaluate() after training.
Reason
Similar to the label_ids the prediction logits are passed to _compute_metrics() as a matrix padded with -100 values.
Currently the argmax call which maps logits to token ids converts these -100 values to 0. So after the argmax, pred_ids will be 0-padded. For most wav2vec2 models this is not an issue, because their vocab.json assigns the ID 0 to the <pad> token. However, if you use a custom TokenSet for finetuning, <pad> will most probably not be mapped to 0, so the obtained "0-padding" values will wrongly correpond to another token.
Proposed Solution
Save a padding_mask which stores the location of the padded -100 in the prediction logits. Then after applying the argmax use the mask to set the padded entries to the ID corresponding to the padding token.
This is a relatively hidden bug and took me quite some time to debug :)
Issue
_compute_metrics()
in the evaluation loop calculates wrong CER & WER metrics when using a TokenSet where the pad_token_id is not equal to 0. This doesn't affect the loss calculation / training as such, but the logged metrics during training will be wrong and won't match the metrics calculated usingmodel.evaluate()
after training.Reason Similar to the
label_ids
the prediction logits are passed to_compute_metrics()
as a matrix padded with -100 values. Currently theargmax
call which maps logits to token ids converts these-100
values to0
. So after the argmax,pred_ids
will be 0-padded. For most wav2vec2 models this is not an issue, because theirvocab.json
assigns the ID 0 to the<pad>
token. However, if you use a custom TokenSet for finetuning,<pad>
will most probably not be mapped to 0, so the obtained "0-padding" values will wrongly correpond to another token.See relevant code here: https://github.com/jonatasgrosman/huggingsound/blob/main/huggingsound/trainer.py#L599
Proposed Solution Save a
padding_mask
which stores the location of the padded -100 in the prediction logits. Then after applying theargmax
use the mask to set the padded entries to the ID corresponding to the padding token.