srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
824 stars 342 forks source link

Importance of utterance lengths #156

Open ericbolo opened 6 years ago

ericbolo commented 6 years ago

The utterances in the TEDLIUM dataset roughly range from 8 to 15 seconds.

I have a dataset with shorter utterances, ~5 to 10 seconds long.

What are the optimal and minimum lengths of utterances for RNN-CTC?

ericbolo commented 6 years ago

Related question: I know CMVN (cepstral mean and variance normalization) can suffer from short utterances. In my current dataset I have only one locution per speaker.

Has anyone trained on a similar dataset (short utterances, one utterance per speaker)?

Thanks all !

fmetze commented 6 years ago

Yes, cmvn can be sensitive to short utterances. You may want to smooth utterances, or have a sliding window - if your data supports that.

We did some experiments with using power (signal energy) to determine where to compute the CMVN statistics on in the lorelei branch (new files in the featbin directory), but they were ultimately inconclusive. The process is to get an alignment (using Kaldi in this case), and compute the CMVN on the non-silence frames only, then apply it to all frames. Alternatively, you can fake the alignment using power (signal energy) only, or with some other criterion, and determine the non-silence frames with them only. The purpose of this is to make the CMVN calculation independent of the actual segmentation, which may be arbitrary.

Let me know if this works for you, we’d be interested in an update as well.

On Nov 25, 2017, at 3:26 PM, ericbolo notifications@github.com wrote:

Related question: I know CMVN can suffer from short utterances. In my current dataset I have only one locution per speaker.

Has anyone trained on a similar dataset (short utterances, one utterance per speaker)?

Thanks all !

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/156#issuecomment-346964001, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8XawbOH-8ipwCuhrMvPvyIUFj6S4ks5s6HfsgaJpZM4QqYfc.

Florian Metze http://www.cs.cmu.edu/directory/florian-metze Associate Research Professor Carnegie Mellon University

ericbolo commented 6 years ago

Thank you, @fmetze !

This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html

However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation?

As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand.

I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer!

fmetze commented 6 years ago

The sliding window should typically be a few seconds long, not? Then it just computes some local context and assumes that the speaker characteristics don’t change quickly. For talks or telephony speech, this is certainly true. For Meetings, it may be less true. Keep me posted - I’ve always wanted to look into this, too.

On Nov 26, 2017, at 3:09 AM, ericbolo notifications@github.com wrote:

Thank you, @fmetze https://github.com/fmetze !

This Kaldi module applies a sliding window for CMVN computation: http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html http://kaldi-asr.org/doc/apply-cmvn-sliding_8cc.html However, I don't understand the advantage of sliding windows. Is it simply a kind of data augmentation?

As for running CMVN on voiced frames only, I could try using a few voice activity detection algorithms I have at hand.

I will first run the experiment with the simple CMVN, then try these optimizations if needed. In any case, I'll keep you appraised, thanks again for your prompt and detailed answer!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/156#issuecomment-346991623, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8SW1tdZk03oPWD0aJca4mNk-Afvpks5s6RzGgaJpZM4QqYfc.

ericbolo commented 6 years ago

A quick update: with regular CMVN, no sliding window, the phonetic model reaches 79% token accuracy. So the model learns fairly well in spite of there being short utterances, and only one utterance per speaker.

(edit: to be more precise, reaches 90% train token accuracy, and 79% on cross-validation set)