jonatasgrosman / huggingsound

HuggingSound: A toolkit for speech-related tasks based on Hugging Face's tools
MIT License
430 stars 42 forks source link

Solved issue related to prediction padding and pad_token_id #68

Closed nkaenzig closed 2 years ago

nkaenzig commented 2 years ago

This is a relatively hidden bug and took me quite some time to debug :)

Issue _compute_metrics() in the evaluation loop calculates wrong CER & WER metrics when using a TokenSet where the pad_token_id is not equal to 0. This doesn't affect the loss calculation / training as such, but the logged metrics during training will be wrong and won't match the metrics calculated using model.evaluate() after training.

Reason Similar to the label_ids the prediction logits are passed to _compute_metrics() as a matrix padded with -100 values. Currently the argmax call which maps logits to token ids converts these -100 values to 0. So after the argmax, pred_ids will be 0-padded. For most wav2vec2 models this is not an issue, because their vocab.json assigns the ID 0 to the <pad> token. However, if you use a custom TokenSet for finetuning, <pad> will most probably not be mapped to 0, so the obtained "0-padding" values will wrongly correpond to another token.

image

See relevant code here: https://github.com/jonatasgrosman/huggingsound/blob/main/huggingsound/trainer.py#L599

pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)

pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

Proposed Solution Save a padding_mask which stores the location of the padded -100 in the prediction logits. Then after applying the argmax use the mask to set the padded entries to the ID corresponding to the padding token.

nkaenzig commented 2 years ago

@jonatasgrosman Would be great if we could merge this :)

jonatasgrosman commented 2 years ago

Thanks, @nkaenzig, for noticing that bug and making this contribution :) And sorry for the delayed response