kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.24k stars 5.32k forks source link

Question about long silences in online decoding. #2558

Closed notanaverageman closed 6 years ago

notanaverageman commented 6 years ago

We had performance drop during online recognition that included silences. Adding silence-weight option fixes the records that include say up to 1000-1500 ms (not sure about the exact duration, it might be greater than this) of continuous silence. If the silence is longer than this duration, the performance drop becomes an issue again.

Questions:

By the way, we are using Kaldi as a C++ library. So, it is possible that we are using some things in a wrong way.

Thanks.

olix20 commented 6 years ago

@yusuf-gunaydin have you considered doing vad/sad before decoding?

notanaverageman commented 6 years ago

Normally we are using VAD. However, when I talked with the team they said there may be some circumstances that using VAD may not be possible and in those cases long silences degrade the performance.

If it is expected for Kaldi to have low performance on long silences even with silence weighting I can try to persuade my team to use VAD everywhere.

danpovey commented 6 years ago

Could the issue be that your utterances are very long? I don't recomend to decode things longer than about 15 seconds; things like OpenFst's BestPath won't work very well for such long utterances.

It would be interesting to know what kinds of errors you are getting. Is it that after the silence, you get deletions? Just generally a lot of errors? What is the silence-weight you are using? Is this an nnet3-based setup?

On Fri, Aug 3, 2018 at 4:47 AM, Yusuf Tarık Günaydın < notifications@github.com> wrote:

Normally we are using VAD. However, when I talked with the team they said there may be some circumstances that using VAD may not be possible and in those cases long silences degrade the performance.

If it is expected for Kaldi to have low performance on long silence even with silence weighting I can try to persuade my team to use VAD everywhere.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2558#issuecomment-410229779, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu6l2_fKKTbnWH88cMBTuHbayvxdNks5uNDhDgaJpZM4VUJMr .

notanaverageman commented 6 years ago

The utterances are shorter than 15 seconds. I have added an example utterance as attachment. The structure is like: 0.5s silence -- 3s speech -- 2.5s silence -- 3s speech -- 2.5s silence.

The errors are deletions. For the attached file continous recognition result is "bir iki üç dört beş altı merhaba beş" while the full utterance decoding gives the correct result: "bir iki üç dört beş altı merhaba bir iki üç dört beş" Actually the silence is between "beş altı" and "merhaba", but "merhaba" is recognized correctly even if it is directly after the silence. The words after it are deleted except the last one.

This is an nnet3 based setup. I have tried a whole range of values for silence-weight parameter. It didn't change the result. For these result the value was 0.001.

(I had to zip the wav file. Otherwise GitHub does not accept it.)

danpovey commented 6 years ago

Can you show the two command lines that you used to decode the two files? If you send me, off-list, an archive with enough for me to reproduce it (the command lines, the models, the wav files), hopefully I may have time in the next few days to look into it. It could possibly be a bug.

On Sun, Aug 5, 2018 at 11:59 PM, Yusuf Tarık Günaydın < notifications@github.com> wrote:

The utterances are shorter than 15 seconds. I have added an example utterance as attachment https://github.com/kaldi-asr/kaldi/files/2261429/4.zip. The structure is like: 0.5s silence -- 3s speech -- 2.5s silence -- 3s speech -- 2.5s silence.

The errors are deletions. For the attached file continous recognition result is "bir iki üç dört beş altı merhaba beş" while the full utterance decoding gives the correct result: "bir iki üç dört beş altı merhaba bir iki üç dört beş" Actually the silence is between "beş altı" and "merhaba", but "merhaba" is recognized correctly even if it is directly after the silence. The words after it are deleted except the last one.

This is an nnet3 based setup. I have tried a whole range of values for silence-weight parameter. It didn't change the result. For these result the value was 0.001.

(I had to zip the wav file. Otherwise GitHub does not accept it.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2558#issuecomment-410607899, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuyU4CF0KhIrfTuu8lJ-ocraSiMDAks5uN-lVgaJpZM4VUJMr .

danpovey commented 6 years ago

Also I assume you didn't forget to set the --ivector-silence-weighting.silence-phones option when you set the --ivector-silence-weighting.silence-weight option?

On Mon, Aug 6, 2018 at 11:23 AM, Daniel Povey dpovey@gmail.com wrote:

Can you show the two command lines that you used to decode the two files? If you send me, off-list, an archive with enough for me to reproduce it (the command lines, the models, the wav files), hopefully I may have time in the next few days to look into it. It could possibly be a bug.

On Sun, Aug 5, 2018 at 11:59 PM, Yusuf Tarık Günaydın < notifications@github.com> wrote:

The utterances are shorter than 15 seconds. I have added an example utterance as attachment https://github.com/kaldi-asr/kaldi/files/2261429/4.zip. The structure is like: 0.5s silence -- 3s speech -- 2.5s silence -- 3s speech -- 2.5s silence.

The errors are deletions. For the attached file continous recognition result is "bir iki üç dört beş altı merhaba beş" while the full utterance decoding gives the correct result: "bir iki üç dört beş altı merhaba bir iki üç dört beş" Actually the silence is between "beş altı" and "merhaba", but "merhaba" is recognized correctly even if it is directly after the silence. The words after it are deleted except the last one.

This is an nnet3 based setup. I have tried a whole range of values for silence-weight parameter. It didn't change the result. For these result the value was 0.001.

(I had to zip the wav file. Otherwise GitHub does not accept it.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2558#issuecomment-410607899, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuyU4CF0KhIrfTuu8lJ-ocraSiMDAks5uN-lVgaJpZM4VUJMr .

notanaverageman commented 6 years ago

Yes, I have set the silence phones option.

The issue seems to occur on specific model and utterance combinations. I have tried another utterance with the same model, but it is recognized correctly. And other models can recognize the utterance with silence I sent before.

We are trying to train a small model to reproduce the issue. (The one we tried before is a private company model.) However, we haven't been able to create a model and utterance combination yet. As soon as we find a reproducible case I will post it here.

danpovey commented 6 years ago

Unless this is an issue that shows up clearly in statistics, for instance the offline decoder recognizes silence on 20% of frames and on the same data the online decoder recognizes silence on 25% of frames, I don't think it's something I should get into debugging because it's probably not a bug. It's expected that there are slight differences between the two decoders, due to small differences in how the ivectors are averaged and (for recurrent models) how context is handled. So it's not necessarily a bug if one particular file shows a difference in recognition output.

On Wed, Aug 8, 2018 at 9:02 AM, Yusuf Tarık Günaydın < notifications@github.com> wrote:

Yes, I have set the silence phones option.

The issue seems to occur on specific model and utterance combinations. I have tried another utterance with the same model, but it is recognized correctly. And other models can recognize the utterance with silence I sent before.

We are trying to train a small model to reproduce the issue. (The one we tried before is a private company model.) However, we haven't been able to create a model and utterance combination yet. As soon as we find a reproducible case I will post it here.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2558#issuecomment-411459726, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuzv_tqFyk0tw0-SGaFbJ6IyL-OmYks5uOwuogaJpZM4VUJMr .

notanaverageman commented 6 years ago

OK. I will close this issue now. If we can find a consistent case I will reopen it.