Closed YooSungHyun closed 2 years ago
cc @sanchit-gandhi as well
@patrickvonplaten plz help!
Hey @YooSungHyun!
I too have experienced differences in eval WER results by changing my padding strategy. In this case, I changed how I bucketed my inputs from bins of 2s to 1.5s, and got a 0.5% WER improvement when training on LibriSpeech 100h and evaluating on validation.clean. It looks like your example is much more severe!
Theoretically speaking, padding should not impact the training or evaluation results: the attention mask ensures that padded inputs/labels are not attended to and sets them to a large negative number in the attention scores, so group norm and self-attention operations should be unaffected. However, practically there might be small differences due to numerical precision, especially if the amount of padding is excessive.
If padding is having such a large effect on your evaluation results, it might be worthwhile injecting some custom behaviour into the Trainer
. What you can do is override the _get_eval_sampler
method to return the LengthGroupedSampler
instead of the sequential sampler:
from typing import Optional
import datasets
import torch
from datasets import Dataset
from torch.utils.data import SequentialSampler
from transformers import Trainer, is_datasets_available
from transformers.trainer_pt_utils import LengthGroupedSampler
from packaging import version
class CustomTrainer(Trainer):
def _get_eval_sampler(self, eval_dataset: Dataset) -> Optional[torch.utils.data.Sampler]:
if self.args.group_by_length:
# Build the sampler. Adapted from _get_train_sampler
generator = None
if version.parse(torch.__version__) >= version.parse("1.6"):
generator = torch.Generator()
generator.manual_seed(self.args.data_seed)
if is_datasets_available() and isinstance(self.eval_dataset, datasets.Dataset):
lengths = (
eval_dataset[self.args.length_column_name]
if self.args.length_column_name in self.eval_dataset.column_names
else None
)
else:
lengths = None
model_input_name = self.tokenizer.model_input_names[0] if self.tokenizer is not None else None
return LengthGroupedSampler(
self.args.eval_batch_size,
dataset=eval_dataset,
lengths=lengths,
model_input_name=model_input_name,
generator=generator,
)
else:
return SequentialSampler(eval_dataset)
trainer = CustomTrainer(model=model, ...)
Let me know how you get on
hi, bro! @sanchit-gandhi !
Lol! you make custom trainer???? 😂 Awesome!!
But, I have another very easy way solution....kkk!!!
eval and predict loop used SequentialSampler right? so! i only sorted my datasets.
look like this,
If training, group_by_length working. don't sort! If eval & predict, group_by_length not working, so sorting and input SequentialSampler -> it works looks like LengthGroupedSampler So, i don`t have to override anymore!! 😎
and, anyway, i think that problem is caused layer normalization. not attention. attention is innocent! wav2vec 2.0 pre-training have to select group or layer norm. and i debugging already it. using pad & not using pad(batch 1)'s normalize output is different and in case of very long sequence text and very short text (2 batchs), short text's attention output(context vector) is looks like all pad so, model predict empty text ''. so WER metric is high. that is problem🦄
Hey @YooSungHyun!
Nice, the .sort()
trick you used is neat! As you said, this is fine for the dev/test datasets where we don't require shuffling, and so a deterministic sorting strategy is entirely valid.
There is indeed strange behaviour in the original Wav2Vec2 base checkpoint caused by a bug in the computation of the layer-norm layers: https://github.com/huggingface/transformers/blob/84beb8a49bf137a88d1b29ab3a85ba0a3cd097d5/src/transformers/models/wav2vec2/configuration_wav2vec2.py#L98
This was copied one-to-one from the original fariseq implementation!
You could try using a checkpoint that uses the 'stable' layer-norm implementation, i.e. one of the large checkpoints: https://huggingface.co/facebook/wav2vec2-large-lv60/blob/main/config.json#L42
THX @sanchit-gandhi
i'm already use that do_stable_layer_norm
that problem raised too. so, i have to sorted eval, test set...😂
and also, wav2vec2-conformer is not supported that param!
do you agree pad issue is raised to layer_norm?
Sure, if you're using Wav2Vec2Conformer then the only configuration is the correct layer-norm implementation. It's hard to know where the issue lies without a reproducible example, could you maybe provide a short code-snippet that I could run to see how you're padding the data? Thanks!
@sanchit-gandhi THX for reply i checked wav2vec2-conformer, that is already do_stable_layer_norm like...!
in case, i just pretrained base model and Wav2Vec2ForCTC finetuning. (do_stable_layer_norm is True, group_by_length True) and finally when i do predict loop (for model evaluation(testing)) first case. eval_set shuffle and eval batch size 2 WER is high second case. eval_set sort and eval batch size 2 WER is lower than first case third case. eval_set sort and eval batch size 1 WER is the lowest fourth case, eval_set sort and eval batch size 1 WER is same that third case.
so, i think batch and shuffle is affect WER. that is reason to 'padded data is affect to layer normalization'. do_stable_layer_norm is not help for this situation i think.
i used source is https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py but dataset is korean audio and korean text
Okay interesting - could you check the losses for the four cases - are they the same or do they differ? If they are the same it's a tokenizer issue with padding. Otherwise likely a modelling issue!
@sanchit-gandhi i`m very busy now, so i will reply this comment as soon as possible bro!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
ubuntu 18.04 python 3.6, 3.9 transformers 1.18.0
Who can help?
@patrickvonplaten, @anton-l
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
this is my real test case sorted case (batch: metric_result) 8: {'test_wer': 0.2266084739113378, 'test_cer': 0.08425300357677845} 4: {'test_wer': 0.22646135739505688, 'test_cer': 0.08419186206474887} 2: {'test_wer': 0.2264123185562966, 'test_cer': 0.08417657668674146} 1: {'test_wer': 0.22646135739505688, 'test_cer': 0.08419186206474887}
un sorted case 8: 35% 4: not test 2: 25% 1: {'eval_wer': 0.22646135739505688, 'eval_cer': 0.08419186206474887}
maybe, CNN Layer or Group normalization is affect to padded data...? (both config all raised this issue) when i was training, input group_by_length=True, so training is good i think but, eval, test sampler is just sequential sampler, so eval or predict test wer result is some weired