k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
534 stars 107 forks source link

Sherpa decoding with zipformer-2 returns Empty hypothesis in few cases #463

Open uni-manjunath-ke opened 1 year ago

uni-manjunath-ke commented 1 year ago

HI @csukuangfj We are getting empty hypothesis for some audios when decoded using latest sherpa with zipformer-2 for some wav files. However, when we decode the same audio files with Icefall we are getting correct hypothesis.

Further, we amplified the wav files that returned empty transcription (since we noticed that some of these wav files had very low volume), and tried decoding with Sherpa, and we got some hypothesis (instead of empty hypothesis).

Could you please let us know, whether there is any difference in the pre-processing or feat extraction done by Icefall and Sherpa. Is it possible to make both of them to have same pre-processing /feat extrtaction? Or Do we have any work around to preprocess these wave files before passing to sherpa, to resolve this. Thanks

csukuangfj commented 1 year ago

@uni-manjunath-ke

Could you try https://github.com/k2-fsa/sherpa/pull/464 ?

It fixes an issue in sherpa for decoding.

uni-manjunath-ke commented 1 year ago

Sure, will try and update you.

uni-manjunath-ke commented 1 year ago

I verified it. This fix has definitely improved the WERs (by some delta of around 0.5%). But, still it is around 3 to 4% worser than what we get using icefall's streaming_decode.py on a given dataset. And, I still see them empty hypothesis for some utterances, though it has reduced than earlier. Please suggest if there is anything else. Thank you.

danpovey commented 1 year ago

Could it be a difference in volume normalization? E.g. maybe we do volume normalization only in non-streaming mode?

uni-manjunath-ke commented 1 year ago

Yes, May be. Is it possible to fix that? Thanks

uni-manjunath-ke commented 1 year ago

In addition, even in Icefall streaming variant, we have seen that there is discrepancy between the WERs of decode.py to streaming_decode.py.

uni-manjunath-ke commented 1 year ago

Initially, we had even poor WERs. But, after debugging & comparing feature extraction of Sherpa & Icefall, we found there is discrepancy in the feature extraction settings. After setting "fbank_opts" parameters in sherpa/cpp_api/feature-config.cc to be same as that of Icefall ones, our WERs have significantly improved. We can create a pull request with these changes, if required. Please let us know. Thanks @uni-sagar-raikar

danpovey commented 1 year ago

Please do!

uni-manjunath-ke commented 1 year ago

Sure, I have created a pullrequest at https://github.com/k2-fsa/sherpa/pull/465 Thanks

uni-manjunath-ke commented 1 year ago

Pls let us know, if there is any fix for volume normalisation for sherpa. In case, if we find anything, we will share the same with you. Thanks

csukuangfj commented 1 year ago

is your test data of 16000Hz, i,e, the same sampling rate as the training data?

uni-manjunath-ke commented 1 year ago

Actually, we use 8000 HZ both for training & testing We change the sherpa code appropriately to handle 8000 Hz and use it.