Wav2Vec 2.0 model output logits related audio pad?

YooSungHyun commented 2 years ago

System Info

ubuntu 18.04 python 3.6, 3.9 transformers 1.18.0

Who can help?

@patrickvonplaten, @anton-l

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

dev, test datasets is shuffled
eval, predict loop model input is not smart_batching (not support group_by_length)
then large batch size input calculated wer is higher than small batch size input
if i sorted audio length dev, test datasets, wer compute_metric is faster and not affected by batch size

Expected behavior

this is my real test case sorted case (batch: metric_result) 8: {'test_wer': 0.2266084739113378, 'test_cer': 0.08425300357677845} 4: {'test_wer': 0.22646135739505688, 'test_cer': 0.08419186206474887} 2: {'test_wer': 0.2264123185562966, 'test_cer': 0.08417657668674146} 1: {'test_wer': 0.22646135739505688, 'test_cer': 0.08419186206474887}

un sorted case 8: 35% 4: not test 2: 25% 1: {'eval_wer': 0.22646135739505688, 'eval_cer': 0.08419186206474887}

maybe, CNN Layer or Group normalization is affect to padded data...? (both config all raised this issue) when i was training, input group_by_length=True, so training is good i think but, eval, test sampler is just sequential sampler, so eval or predict test wer result is some weired

LysandreJik commented 2 years ago

cc @sanchit-gandhi as well

YooSungHyun commented 2 years ago

@patrickvonplaten plz help!

sanchit-gandhi commented 2 years ago

Hey @YooSungHyun!

I too have experienced differences in eval WER results by changing my padding strategy. In this case, I changed how I bucketed my inputs from bins of 2s to 1.5s, and got a 0.5% WER improvement when training on LibriSpeech 100h and evaluating on validation.clean. It looks like your example is much more severe!

Theoretically speaking, padding should not impact the training or evaluation results: the attention mask ensures that padded inputs/labels are not attended to and sets them to a large negative number in the attention scores, so group norm and self-attention operations should be unaffected. However, practically there might be small differences due to numerical precision, especially if the amount of padding is excessive.

If padding is having such a large effect on your evaluation results, it might be worthwhile injecting some custom behaviour into the Trainer. What you can do is override the _get_eval_sampler method to return the LengthGroupedSampler instead of the sequential sampler:

from typing import Optional
import datasets
import torch
from datasets import Dataset
from torch.utils.data import SequentialSampler
from transformers import Trainer, is_datasets_available
from transformers.trainer_pt_utils import LengthGroupedSampler
from packaging import version

class CustomTrainer(Trainer):
    def _get_eval_sampler(self, eval_dataset: Dataset) -> Optional[torch.utils.data.Sampler]:
        if self.args.group_by_length:
            # Build the sampler. Adapted from _get_train_sampler
            generator = None
            if version.parse(torch.__version__) >= version.parse("1.6"):
                generator = torch.Generator()
                generator.manual_seed(self.args.data_seed)
            if is_datasets_available() and isinstance(self.eval_dataset, datasets.Dataset):
                lengths = (
                    eval_dataset[self.args.length_column_name]
                    if self.args.length_column_name in self.eval_dataset.column_names
                    else None
                )
            else:
                lengths = None
            model_input_name = self.tokenizer.model_input_names[0] if self.tokenizer is not None else None
            return LengthGroupedSampler(
                    self.args.eval_batch_size,
                    dataset=eval_dataset,
                    lengths=lengths,
                    model_input_name=model_input_name,
                    generator=generator,
                )
        else:
            return SequentialSampler(eval_dataset)

trainer = CustomTrainer(model=model, ...)

Let me know how you get on

YooSungHyun commented 2 years ago

hi, bro! @sanchit-gandhi !

Lol! you make custom trainer???? 😂 Awesome!!

But, I have another very easy way solution....kkk!!!

eval and predict loop used SequentialSampler right? so! i only sorted my datasets.

look like this,

If training, group_by_length working. don't sort! If eval & predict, group_by_length not working, so sorting and input SequentialSampler -> it works looks like LengthGroupedSampler So, i don`t have to override anymore!! 😎

and, anyway, i think that problem is caused layer normalization. not attention. attention is innocent! wav2vec 2.0 pre-training have to select group or layer norm. and i debugging already it. using pad & not using pad(batch 1)'s normalize output is different and in case of very long sequence text and very short text (2 batchs), short text's attention output(context vector) is looks like all pad so, model predict empty text ''. so WER metric is high. that is problem🦄

sanchit-gandhi commented 2 years ago

Hey @YooSungHyun!

Nice, the .sort() trick you used is neat! As you said, this is fine for the dev/test datasets where we don't require shuffling, and so a deterministic sorting strategy is entirely valid.

There is indeed strange behaviour in the original Wav2Vec2 base checkpoint caused by a bug in the computation of the layer-norm layers: https://github.com/huggingface/transformers/blob/84beb8a49bf137a88d1b29ab3a85ba0a3cd097d5/src/transformers/models/wav2vec2/configuration_wav2vec2.py#L98

This was copied one-to-one from the original fariseq implementation!

You could try using a checkpoint that uses the 'stable' layer-norm implementation, i.e. one of the large checkpoints: https://huggingface.co/facebook/wav2vec2-large-lv60/blob/main/config.json#L42

YooSungHyun commented 2 years ago

THX @sanchit-gandhi i'm already use that do_stable_layer_norm that problem raised too. so, i have to sorted eval, test set...😂 and also, wav2vec2-conformer is not supported that param!

do you agree pad issue is raised to layer_norm?

sanchit-gandhi commented 2 years ago

Sure, if you're using Wav2Vec2Conformer then the only configuration is the correct layer-norm implementation. It's hard to know where the issue lies without a reproducible example, could you maybe provide a short code-snippet that I could run to see how you're padding the data? Thanks!

YooSungHyun commented 2 years ago

@sanchit-gandhi THX for reply i checked wav2vec2-conformer, that is already do_stable_layer_norm like...!

in case, i just pretrained base model and Wav2Vec2ForCTC finetuning. (do_stable_layer_norm is True, group_by_length True) and finally when i do predict loop (for model evaluation(testing)) first case. eval_set shuffle and eval batch size 2 WER is high second case. eval_set sort and eval batch size 2 WER is lower than first case third case. eval_set sort and eval batch size 1 WER is the lowest fourth case, eval_set sort and eval batch size 1 WER is same that third case.

so, i think batch and shuffle is affect WER. that is reason to 'padded data is affect to layer normalization'. do_stable_layer_norm is not help for this situation i think.

i used source is https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py but dataset is korean audio and korean text

sanchit-gandhi commented 2 years ago

Okay interesting - could you check the losses for the four cases - are they the same or do they differ? If they are the same it's a tokenizer issue with padding. Otherwise likely a modelling issue!

YooSungHyun commented 2 years ago

@sanchit-gandhi i`m very busy now, so i will reply this comment as soon as possible bro!

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers