Silence and Wrong Words

it-muslim / kaldi-helpers

Helper scripts to work with Kaldi

MIT License

6 stars 0 forks source link

Silence and Wrong Words #2

Closed dpny518 closed 5 years ago

dpny518 commented 5 years ago

How do you deal with wrong words or silence audios, and the text aligns to wrong words for example the text is "hello how are you" and the user says "ummm hello hello how are you" it aligns to the wrong parts to get wrong scores, also I get scores for silence.

rguliev commented 5 years ago

As far as I understood your question, there is a special phoneme for such "noise" sounds. For example, if kaldi for dummies lexicon.txt looks like:

!SIL sil
<UNK> spn
eight ey t
five f ay v
...

Here spn is a "dummy" phoneme for such sounds. If it does not help, please provide a more descriptive example.

dpny518 commented 5 years ago

Can you try it with any audio text under these two conditions.

An audio with silence and random noise with no speech matching alignment text.
Audio with text similar to alignment text but added words

rguliev commented 5 years ago

I guess that forced alignment would fail in such cases but decoding might work. I am not sure about that. Besides, it always depends on your task. I think it would be better to just try out a few examples.

I'm closing this issue since it is not relevant to the repo.

dpny518 commented 5 years ago

@rguliev it is a an issue with pronunication assessment though, for example you ask a english learner to speak the sentence "hello nice to meet you" and he says helllo umm umm, nice meet nice to meet you"

rguliev commented 5 years ago

On decoding, it might work. But it is better to try to be sure:)

dpny518 commented 5 years ago

yes the issue is that a score for mispronouncing words and speaking complete wrong words are very similar. Even speaking correctly the score is similar to silence. It is hard to translate the scorings to 0 to 100 consistently

rguliev commented 5 years ago

If by "score" you mean computed per phoneme probabilities, then it really depends on your case: how big is the training dataset, how well your model is trained, etc. If you want to validate the probs, you can do the following:

Take an utterance which is 100% correctly decoded on word level. Then check probs of phonemes there
Take an utterance which has one-two incorrectly decoded words. Then check probs on correct frames and on frames close to the mistakes.

About complete wrong or silence records. You should really do some research. It might be reasonable to use a word-level prediction for filtering them out.