mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.17k stars 3.95k forks source link

(bug) including noise recordings lead to inf loss #1941

Closed nicolaspanel closed 5 years ago

nicolaspanel commented 5 years ago

I'm using https://github.com/mozilla/DeepSpeech/commit/382707388ab32e9066c61dcaf621982942139500

Including noise recordings (ie recordings with expected transcription set to « ») lead to inf loss during training.

kdavis-mozilla commented 5 years ago

Why are you including nose with the transcript ""?

The usual way to make the model robust to background noise is to have a data set with non-trivial transcripts with background noise in it already or to lay background noise over a clean(er) data set with non-trivial transcripts.

nicolaspanel commented 5 years ago

I'm processing long recordings (up to 2h) which are cut into sub-recordings of 1 to 10 seconds using VAD. Since it is very common that VAD detects background noise as voice, even if nothing have been said, I'd like to include this samples to the training set.

kdavis-mozilla commented 5 years ago

I'm still not sure I understand the logic of including such in the training set. Is it to make the system not transcribe "silences", which may include background noise.

The system will learn to ignore "silences", which may include background noise, just from normal training data with non-trivial transcripts, as pauses between words and phrases provide training data for such.

Is your goal in the "live system" to feed such 2h audio recordings, segmented by VAD, to the system? Do you hope that your training with empty transcripts will train the system to deal with VAD failures?

nicolaspanel commented 5 years ago

I'm still not sure I understand the logic of including such in the training set. Is it to make the system not transcribe "silences", which may include background noise.

The system will learn to ignore "silences", which may include background noise, just from normal training data with non-trivial transcripts, as pauses between words and phrases provide training data for such.

You are right. But since it is a very common situation for me I would like to use this samples as additional training data.

Is your goal in the "live system" to feed such 2h audio recordings, segmented by VAD, to the system?

yes

Do you hope that your training with empty transcripts will train the system to deal with VAD failures?

My guess is that it will lead to better silence and background noise detection and then improve global accuracy.

reuben commented 5 years ago

The large difference between input length and correct label length could be making the CTC loss explode. You could try making the label several spaces rather than just one, or some other strategy like one space for every 300ms of audio and see if that solves it.

kdavis-mozilla commented 5 years ago

@reuben I think "making the label several spaces rather than just one" would have the unwanted effect of causing the model to now output multiple spaces if someone paused "too long" between words, which is also not desired.

reuben commented 5 years ago

Yes, that would have to be fixed in postprocessing. It's not optimal, but would be better than not being able to train the model at all.

kdavis-mozilla commented 5 years ago

@nicolaspanel Wouldn't it be easier to try and use a better VAD? Maybe something like this?

nicolaspanel commented 5 years ago

@nicolaspanel Wouldn't it be easier to try and use a better VAD? Maybe something like this?

@kdavis-mozilla I am currently using webrtc-vad which have the advantage of being very fast with very few hyperparameters. Since DS does a pretty good job already I'm not sure including a new processing step will worth the cost. Furthermore, it is not clear how https://github.com/jtkim-kaist/VAD or alternatives will behave when people are laughing, sneezing, etc.

@reuben I will investigate further to confirm it comes from https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss

nicolaspanel commented 5 years ago

@reuben @kdavis-mozilla after further investigations, and removing "noise" recordings, I was still facing the inf loss issue.

After a closer look, I noticed a single mislabeled recording which : 1) raised W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found (see https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/core/util/ctc/ctc_loss_calculator.cc#L144) 2) produced the inf loss

(probably the same issue as https://github.com/mozilla/DeepSpeech/issues/1910 BTW)

Since it is very frustrating to waste training/investigation hours for a single mislabeled recording, maybe It could be usefull to include a tf.where for replacing infs with zeros here:

total_loss = tf.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len)
# replace `inf` with 0
total_loss = tf.where(tf.is_inf(total_loss), tf.zeros_like(total_loss), total_loss)
...

What do you think ?

reuben commented 5 years ago

If anything, I think we should do the opposite of masking with zero: print an error with the incorrect transcript and stop the training immediately.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.