Wav2Vec2 Conformer loss nan and wer 1 issue

YooSungHyun commented 2 years ago

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.21.1
Platform: Linux-4.15.0-177-generic-x86_64-with-glibc2.27
Python version: 3.9.13
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.1+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@patrickvonplaten , @anton-l , @sanchit-gandhi

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Datasets: my own korea wav file, and text datasets pre-trained model - Wav2Vec2 Conformer Fine-Tuning strategy : example run_speech_recognition_ctc.py (https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py)

audio length min 16000 ~ max 490000 sampling_rate 16000

when i training after about 400000 steps (3~4 epoch), loss is nan and wer is 1.01

do_stable_layer_norm True mean ctc and zero inf True

Expected behavior

my loss & wer is reduced stable

sanchit-gandhi commented 2 years ago

Hey @YooSungHyun - could you possibly link your wandb logs? I can then take a closer look at the nan loss!

YooSungHyun commented 2 years ago

@sanchit-gandhi Hi! that is my company`s private wandb. so i can not share to you

but i have one hypothesis, and now testing. confomer has convolution modul, so, conformer output audio length is shorter than wav2vec2-base so, some data audio length is shorter than label length. that make ctc loss's inf issue. (my zero_inf param is true, but i think 0 loss is noise, too. because in 'mean' strategy, denominator is diffrent (0 or non zero loss))

i think some inf datas make confuse to model that is made until some epoch i can not reply #18501 issue because this issue is higher priority 😥

sanchit-gandhi commented 2 years ago

Hey @YooSungHyun! Sorry for the late reply.

You can filter based on the audio input length or transcription output length. You should make sure that the audio input length is large enough to give at least one Wav2Vec2 hidden-state after the convolutional module, and that the transcription output length is larger than zero to give at least one term in the CTC loss.

Audio input length: you can set min_duration_in_seconds to a value greater than zero to filter audio samples less than a certain length in seconds (c.f. run_speech_recognition_seq2seq.py#L407-L410). Each Wav2Vec2 feature encodes roughly 25ms of audio, so I would advise you set this to a value greater than 0.025.
Transcription output length: you could add a filtering criterion to filter samples less than a minimum target length (c.f. run_flax_speech_recognition_ctc.py#L1140). You can set this to a non-zero value to filter out zero length transcriptions.

YooSungHyun commented 2 years ago

@sanchit-gandhi hello!

i used 4 kinds of dataset. and 2 datasets have this problem. so, i will filter 'at least 25ms' and 'cnn output lengths / N > labels'

i will test it and sharing to you

sanchit-gandhi commented 2 years ago

Great! For the inputs, filtering by a minimum input length of 25ms should suffice. This is based on the down-sampling ratio of Wav2Vec2. You can work out the precise value based on the down-sampling factor of the conv layers!

For the outputs, you just need non-zero label lengths (such that the number of terms in the cross-entropy loss is non-zero); nothing fancy required with the down-sampling ratio here!

YooSungHyun commented 1 year ago

i filtered 25ms and feed len(labels) > 0, but eval loss reached NaN on 3~4 epoch...😥 big stress....

sanchit-gandhi commented 1 year ago

Ooof ok, tricky issue! How does the training loss look? Is the training loss / gradients exploding? That fact that you get a real-valued eval loss and WER for the first 2 epochs means your filtering is most likely correct (otherwise you'd get Nan on the first eval step).

If you're able to provide a small reproducible codesnippet that would help massively.

sanchit-gandhi commented 1 year ago

Side note, if you're interested in good ASR performance and are not too bothered whether it's from Wav2Vec2 or a different model, you could try fine-tuning Whisper (see https://huggingface.co/blog/fine-tune-whisper) -> I've found it to be more stable and generally more performant than Wav2Vec2 CTC

YooSungHyun commented 1 year ago

@sanchit-gandhi thx for reply! but, i have to use wav2vec2 conformer....😢

i think, my data have issue, so validating my dataset how about this? do you think this situation make some problem? (label and pad)

Huggingface smart batching(group_by_length) is mega batch, so, group_by_length sampled every 50 step? 50 batch? so, some short label data can input like this (audio is 1sec) so, another my hypothesis is override batch sampler like usual smart batching (only sampled length order) i will test this and leave a comment

my train loss like this

very intersting thing is it happened only used wav2vec2-conformer (trained scratch for korean)

sanchit-gandhi commented 1 year ago

how about this? do you think this situation make some problem? (label and pad)

It looks like there's a lot of padding for the second item in the batch, but this shouldn't cause problems to stability, only those related to numerical precision (all the labels with -100 are set to -inf in the loss computation to be masked, there'll be a numerical tolerance vs perfect masking).

Can you maybe find the corresponding audio for the sample where the train loss collapses and check this is properly prepared and pre-processed?

I don't think this is related necessarily to batch sorting.

YooSungHyun commented 1 year ago

@sanchit-gandhi hum.... very tragedy some wav data is not fair to text data...damn! ex) wav: some apple is good for you when eat morning / text: apple (how dumb!?) maybe this data make loss over shooting..? i will filtering now...😭

sanchit-gandhi commented 1 year ago

IMO data is more important that models in ML! The proof is in the pudding 😉 Just out of interest, how are you planning on filtering this data? Manually? Or do you have a heuristic? What you could do is run a baseline CTC Korean system on all of your text samples and compute the WER on a sample by sample basis. You could then throw out all the samples that exceed say 50% WER, and keep the 'cleaner' samples that are less than 50% WER. Take your example:

Audio: some apple is good for you when eat morning Text: apple Pred: some apple is good for you when eat morning

WER = 900%

=> discard sample!

Another example: Audio: we like to bake cakes and eat crumble Text: we like to bake cakes and eat crumble Pred: we like to bake cakes and meet crumble

WER = 12.5%

=> keep sample

YooSungHyun commented 1 year ago

@sanchit-gandhi holy...! that is awesome idea!!!?? 😮

i just think like heuristic idea. in this case, almost error data have some pattern like, text has long pad but audio has short pad because, audio is long average but text is not.

so, i run group_by_sample sampler now, and, if label data has pad over 90%, then wav, label save. and then, check manually some data.

about 0.1~1% data is corrupted. i am filtering it to wav audio value and label tokenize value

but, your idea is better than me....! how embarrassing! 👽

sanchit-gandhi commented 1 year ago

Good luck! You'll have to set your cut-off WER carefully, but otherwise this is a pretty robust method.

Since the issue is not related to the Transformers modelling code but rather to do with the specific dataset used, I'm going to close this issue. Feel free to post on the forum if you encounter any further difficulties with your training and are seeking help (you can tag me there): https://discuss.huggingface.co

sanchit-gandhi commented 1 year ago

What you could also do is replace the shortened text with the transcriptions from the baseline system if you wanted:

Audio: some apple is good for you when eat morning
Text: apple
Pred: some apple is good for you when eat morning

WER = 900%

=> replace text with pred, new target is: some apple is good for you when eat morning

Again you'll have to experiment to see whether this is viable based on the quality of your baseline transcriptions. This way though you'll throw away less data.

huggingface / transformers