k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
910 stars 291 forks source link

The problems in streaming decode are in pruned_ transducer_ stateless5 #605

Open yangsuxia opened 2 years ago

yangsuxia commented 2 years ago

I used my own data to train a streaming model. The recognition effect is poor when decoding. There are two obvious problems, one is to delete words at the end, and the other is to insert multiple words in the middle. training: image decoding: image result: image

What should I do to reduce these mistakes!I look forward to your reply. Thank you!!!

csukuangfj commented 2 years ago

one is to delete words at the end,

Could you add some tail paddings to your utterances and retry?

csukuangfj commented 2 years ago

the other is to insert multiple words in the middle

This problem is much harder to fix in your case since the ground truth contains 3 contiguous . Does your training data contain data patterns like this?

danpovey commented 2 years ago

should be HYP not HYF. Yes, this is the kind of case where E2E models have trouble.

yangsuxia commented 2 years ago

the other is to insert multiple words in the middle

This problem is much harder to fix in your case since the ground truth contains 3 contiguous . Does your training data contain data patterns like this?

When I set avg to 1, the insertion error decreases a lot

csukuangfj commented 2 years ago

the other is to insert multiple words in the middle

This problem is much harder to fix in your case since the ground truth contains 3 contiguous . Does your training data contain data patterns like this?

When I set avg to 1, the insertion error decreases a lot

What is your original setting?

yangsuxia commented 2 years ago

one is to delete words at the end,

Could you add some tail paddings to your utterances and retry?

I added 1s silence after the testset, and almost didn't delete at the end of the sentence.

Another question is, what is the possible reason for many deletions in the middle of sentences? image

yangsuxia commented 2 years ago

What is your original setting?

The above reply contains the original configuration for decoding. It was written incorrectly. The avg used to be 2, but now it is 1

csukuangfj commented 2 years ago

one is to delete words at the end,

Could you add some tail paddings to your utterances and retry?

I added 1s silence after the testset, and almost didn't delete at the end of the sentence.

Another question is, what is the possible reason for many deletions in the middle of sentences?

image

Have you listened to the audios of theses two utterances? Do they look normal?