kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.24k stars 5.32k forks source link

Online-CMVN speaker info #3554

Closed danpovey closed 5 years ago

danpovey commented 5 years ago

@freewym found this while testing the online-cmvn stuff... he found that online decoding with @hugovbraun's PR did not give the same results as regular decoding. @vesis84 you might want to take note of this too.

The issue appears to be a mismatch in how the speaker information is used. In training, we ignore the speaker information so the online-cmvn binary does not carry forward the CMVN stats from one utterance to the next utterance of the same speaker.

In decode time, that information is used.

What we normally do in training for ivector extraction is to limit the speakers to 2 utts per speaker so that there are plenty of examples of "first-utterance-of-this-speaker" and the model should work well when the speaker info is either known or not known. If we were to use the speaker information for the feature-level CMVN we should also use these _max2 versions of the data directories. (Involves a script change to the training script).

My instinct right now is to modify the training scripts to add the --spk2utt option to apply-cmvn-online in all scripts in the nnet3/ subdirectory that use that binary. That would give us the maximum flexibility, because we could always just force the speakers to be one per utterance if we wanted, during training.

KarelVesely84 commented 5 years ago

Hi, can I see the WER numbers?

My thinking was that, if we don't port cmvn stats accross utts from same speaker, the model will be more robust in situations when the speaker identity cannot be known, for example in the online2-tcp server.

But I did not compare the results with/without porting the cmvn stats... Can you share the results?

danpovey commented 5 years ago

@vesis84 the idea is to leave that decision up to the user by using the speaker info and letting the user set the max2 thing if needed. Anyway, there would be a mismatch with the i-vector computation if we did it like you drafted.

danpovey commented 5 years ago

.. also, of course the user can just make spk2utt a 1-to-1 map in training, like max1. But there's no reason to hardcode that into the scripts

KarelVesely84 commented 5 years ago

I looked into it, in case without i-vectors there is no mismatch. (In the train/decode scripts there is no utt2spk passed to apply-cmvn-online. And in the online2 code, the cmvn stats are not passed across utterances, even if the utt2spk is supplied to the binary). Yes, with i-vector computation, that is another story transfered...

But even without i-vectors I am currently getting very bad results: WER 70%, where I should be getting ~10%. The code in #3560 seems to be okay. And i verified that the OnlineCmvn is active and included in the pipeline with the global cmvn stats loaded. I also compared the features from OnlineNnet2FeaturePipeline with those from the 'bash' pipeline. They are almost the same...

I am getting very many deletions and I am running out of the options what to try. I need to think about it what to try... (it is with online2-wav-nnet3-latgen-faster) K.

freewym commented 5 years ago

I looked into it, in case without i-vectors there is no mismatch. (In the train/decode scripts there is no utt2spk passed to apply-cmvn-online. And in the online2 code, the cmvn stats are not passed across utterances, even if the utt2spk is supplied to the binary). Yes, with i-vector computation, that is another story transfered...

But even without i-vectors I am currently getting very bad results: WER 70%, where I should be getting ~10%. The code in #3560 seems to be okay. And i verified that the OnlineCmvn is active and included in the pipeline with the global cmvn stats loaded. I also compared the features from OnlineNnet2FeaturePipeline with those from the 'bash' pipeline. They are almost the same...

I am getting very many deletions and I am running out of the options what to try. I need to think about it what to try... (it is with online2-wav-nnet3-latgen-faster) K.

Please try this fix

danpovey commented 5 years ago

Resolved in #3615