Closed YooSungHyun closed 1 year ago
Hey @YooSungHyun - could you possibly link your wandb logs? I can then take a closer look at the nan loss!
@sanchit-gandhi Hi! that is my company`s private wandb. so i can not share to you
but i have one hypothesis, and now testing. confomer has convolution modul, so, conformer output audio length is shorter than wav2vec2-base so, some data audio length is shorter than label length. that make ctc loss's inf issue. (my zero_inf param is true, but i think 0 loss is noise, too. because in 'mean' strategy, denominator is diffrent (0 or non zero loss))
i think some inf datas make confuse to model that is made until some epoch i can not reply #18501 issue because this issue is higher priority 😥
Hey @YooSungHyun! Sorry for the late reply.
You can filter based on the audio input length or transcription output length. You should make sure that the audio input length is large enough to give at least one Wav2Vec2 hidden-state after the convolutional module, and that the transcription output length is larger than zero to give at least one term in the CTC loss.
min_duration_in_seconds
to a value greater than zero to filter audio samples less than a certain length in seconds (c.f. run_speech_recognition_seq2seq.py#L407-L410). Each Wav2Vec2 feature encodes roughly 25ms of audio, so I would advise you set this to a value greater than 0.025.@sanchit-gandhi hello!
i used 4 kinds of dataset. and 2 datasets have this problem. so, i will filter 'at least 25ms' and 'cnn output lengths / N > labels'
i will test it and sharing to you
Great! For the inputs, filtering by a minimum input length of 25ms should suffice. This is based on the down-sampling ratio of Wav2Vec2. You can work out the precise value based on the down-sampling factor of the conv layers!
For the outputs, you just need non-zero label lengths (such that the number of terms in the cross-entropy loss is non-zero); nothing fancy required with the down-sampling ratio here!
i filtered 25ms and feed len(labels) > 0, but eval loss reached NaN on 3~4 epoch...😥 big stress....
Ooof ok, tricky issue! How does the training loss look? Is the training loss / gradients exploding? That fact that you get a real-valued eval loss and WER for the first 2 epochs means your filtering is most likely correct (otherwise you'd get Nan on the first eval step).
If you're able to provide a small reproducible codesnippet that would help massively.
Side note, if you're interested in good ASR performance and are not too bothered whether it's from Wav2Vec2 or a different model, you could try fine-tuning Whisper (see https://huggingface.co/blog/fine-tune-whisper) -> I've found it to be more stable and generally more performant than Wav2Vec2 CTC
@sanchit-gandhi thx for reply! but, i have to use wav2vec2 conformer....😢
i think, my data have issue, so validating my dataset how about this? do you think this situation make some problem? (label and pad)
Huggingface smart batching(group_by_length) is mega batch, so, group_by_length sampled every 50 step? 50 batch? so, some short label data can input like this (audio is 1sec) so, another my hypothesis is override batch sampler like usual smart batching (only sampled length order) i will test this and leave a comment
my train loss like this
very intersting thing is it happened only used wav2vec2-conformer (trained scratch for korean)
how about this? do you think this situation make some problem? (label and pad)
It looks like there's a lot of padding for the second item in the batch, but this shouldn't cause problems to stability, only those related to numerical precision (all the labels with -100 are set to -inf in the loss computation to be masked, there'll be a numerical tolerance vs perfect masking).
Can you maybe find the corresponding audio for the sample where the train loss collapses and check this is properly prepared and pre-processed?
I don't think this is related necessarily to batch sorting.
@sanchit-gandhi hum.... very tragedy some wav data is not fair to text data...damn! ex) wav: some apple is good for you when eat morning / text: apple (how dumb!?) maybe this data make loss over shooting..? i will filtering now...ðŸ˜
IMO data is more important that models in ML! The proof is in the pudding 😉 Just out of interest, how are you planning on filtering this data? Manually? Or do you have a heuristic? What you could do is run a baseline CTC Korean system on all of your text samples and compute the WER on a sample by sample basis. You could then throw out all the samples that exceed say 50% WER, and keep the 'cleaner' samples that are less than 50% WER. Take your example:
Audio: some apple is good for you when eat morning Text: apple Pred: some apple is good for you when eat morning
WER = 900%
=> discard sample!
Another example: Audio: we like to bake cakes and eat crumble Text: we like to bake cakes and eat crumble Pred: we like to bake cakes and meet crumble
WER = 12.5%
=> keep sample
@sanchit-gandhi holy...! that is awesome idea!!!?? 😮
i just think like heuristic idea. in this case, almost error data have some pattern like, text has long pad but audio has short pad because, audio is long average but text is not.
so, i run group_by_sample sampler now, and, if label data has pad over 90%, then wav, label save. and then, check manually some data.
about 0.1~1% data is corrupted. i am filtering it to wav audio value and label tokenize value
but, your idea is better than me....! how embarrassing! 👽
Good luck! You'll have to set your cut-off WER carefully, but otherwise this is a pretty robust method.
Since the issue is not related to the Transformers modelling code but rather to do with the specific dataset used, I'm going to close this issue. Feel free to post on the forum if you encounter any further difficulties with your training and are seeking help (you can tag me there): https://discuss.huggingface.co
What you could also do is replace the shortened text with the transcriptions from the baseline system if you wanted:
Audio: some apple is good for you when eat morning
Text: apple
Pred: some apple is good for you when eat morning
WER = 900%
=> replace text
with pred
, new target is: some apple is good for you when eat morning
Again you'll have to experiment to see whether this is viable based on the quality of your baseline transcriptions. This way though you'll throw away less data.
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.21.1Who can help?
@patrickvonplaten , @anton-l , @sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Datasets: my own korea wav file, and text datasets pre-trained model - Wav2Vec2 Conformer Fine-Tuning strategy : example run_speech_recognition_ctc.py (https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py)
audio length min 16000 ~ max 490000 sampling_rate 16000
when i training after about 400000 steps (3~4 epoch), loss is nan and wer is 1.01
do_stable_layer_norm True mean ctc and zero inf True
Expected behavior
my loss & wer is reduced stable