Severe Deletion Errors with pruned_transducer_stateless7_streaming on Japanese Dataset Validation

Hi Next-gen Kaldi team,

Thank you for your detailed documentation and support through the WeChat discussion group. We have been developing a new recipe for our open-sourced Japanese corpus, ReazonSpeech, and encountered some issues during validation tests.

Recipe Used: pruned_transducer_stateless7_streaming Training Data:

Model 1: 1000h from ReazonSpeech
Model 2: 300h from ReazonSpeech medium

Validation Results:

In-distribution (ReazonSpeech valid set): Excellent performance for both models.
Other Japanese datasets (e.g., JSUT-BASIC5000): Over 50% CER, indicating a significant drop in performance.

Issue Description:

The current recipe, is prone to deletion errors, particularly at the start of utterances. In some cases, it fails to recognize the audio entirely.
For comparison, we used a conformer-transformer model on ESPnet with the same training data, which resulted in better handling of other Japanese datasets.

Reference Issues:

We're seeking guidance and suggestions on addressing these deletion errors and improving the recipe's adaptability to other Japanese datasets.

Thank you for your time and assistance!

More details here:

1000h model (from `ReazonSpeech`)

in-distribution (valid set from ReazonSpeech):

The performance is outstanding as shown below:

Decoding Method	CER
greedy search	11.67
modified beam search	11.11

$ head -n 20 errs-valid-epoch-30-avg-10-streaming-chunk-size-32-modified_beam_search-beam-size-4-use-averaged-model.txt 
%WER = 11.11
Errors: 577 insertions, 1418 deletions, 1297 substitutions, over 29630 reference words (26915 correct)
Search below for sections starting with PER-UTT DETAILS:, SUBSTITUTIONS:, DELETIONS:, INSERTIONS:, PER-WORD STATS:

PER-UTT DETAILS: corr or (ref->hyp)  
640000-0:   1 1 時 1 5 分 に な り ま し た 。
640001-1:   ニ ュ ー ス を お 伝 え し ま す 。
640002-2:   月 が 地 球 の (影->陰) に 覆 わ れ る (皆 既 月 食->海 域) が 3 年 ぶ り に 日 本 で 見 ら れ ま し た 。
640003-3:   皆 既 月 食 は 太 陽 と 地 球 と 月 が 一 直 線 に 並 び (*->、) 満 月 が 地 球 の (影->陰) に 完 全 に 覆 わ れ る 現 象 で す 。
640004-4:   午 後 6 時 4 4 分 ご ろ か ら 満 月 が (欠->か) け 始 め 午 後 8 時 9 分 ご ろ か ら 午 後 8 時 2 8 分 ご ろ ま で の お よ そ 1 9 分 間 (*->。)
640005-5:   完 全 に (影->陰) に (覆->追) わ れ て 皆 既 月 食 と な り ま し た 。
640006-6:   き ょ う は 一 般 に ス ー パ ー ム ー ン と 呼 ば れ る 満 月 と し て は 1 年 で 最 も 地 球 (で->に) 近 づ く 日 で も あ り 最 も 遠 く に あ る 満 月 と 比 べ て 見 (掛->か) け の 直 径 が 1 4 (パ ー セ ン ト->%) 大 き く 見 え ま す 。
640007-7:   晴 れ 間 が 広 が っ た 東 北 や 北 海 道 (*->。)
640008-8:   そ れ に 小 笠 原 諸 島 な ど の 各 地 で 観 測 さ れ ま し た 。
640009-9:   国 立 天 文 台 に よ り ま す と 次 に 日 本 で 皆 既 月 食 が 見 ら れ る の は 来 年 1 1 月 8 日 で 部 分 月 食 は こ と し 1 1 月 1 9 日 に 観 測 で き る と い う こ と で す 。
640010-10:  新 型 コ ロ ナ ウ イ ル ス の 影 響 で 倒 産 し た 企 業 の 数 が 去 年 2 月 か ら の 累 計 で 1 5 0 0 社 に な り ま し た 。

however, the performance on other Japanese dataset drops dramatically, e.g., over 50% CER against JSUT-BASIC5000 corpus.

300h model (from `ReazonSpeech medium`)

in-distribution (valid set from ReazonSpeech medium):

errs-valid-epoch-30-avg-12-streaming-chunk-size-32-context-2-max-sym-per-frame-1-use-averaged-model.txt
%WER 17.31% [4381 / 25312, 988 ins, 1576 del, 1817 sub ]
greedy_search   17.31   best for valid

TEDx:

errs-valid-epoch-30-avg-12-streaming-chunk-size-32-modified_beam_search-beam-size-4-use-averaged-model.txt
%WER 42.37% [81230 / 191731, 2024 ins, 68295 del, 10911 sub ]
beam_size_4     42.37   best for valid

More results:	Chunk Size (ms)	Decoding Method	Params
320	greedy search	--epoch 8 --avg 6	37.43
320	modified beam search	--epoch 8 --avg 5	33.75
640	greedy search	--epoch 8 --avg 6	35.57
640	modified beam search	--epoch 8 --avg 5	32.02

Details of errs-valid-epoch-8-avg-5-streaming-chunk-size-64-modified_beam_search-beam-size-4-use-averaged-model.txt:

2024-01-10 07:53:14,758 INFO [utils.py:641] [valid-beam_size_4] %WER 32.02% [61400 / 191731, 3421 ins, 39205 del, 18774 sub ]
2024-01-10 07:53:15,516 INFO [decode.py:617] Wrote detailed error stats to exp/modified_beam_search/errs-valid-epoch-8-avg-5-streaming-chunk-size-64-modified_beam_search-beam-size-4-use-averaged-model.txt
2024-01-10 07:53:15,516 INFO [decode.py:631] 
For valid, WER of different settings are:
beam_size_4     32.02   best for valid

uttid_0KTVqevvEjo-00006950-00007305-51: 順 番 に 並 ん で (い->*) る (そ ば->祖 母) か ら (ソ ワ ソ ワ->捜 査 は) し て い ま し た
uttid_0KTVqevvEjo-00009457-00010077-52: (本 当 は 二 ノ 駅 な の に->*) そ の 先 の (三 ノ->３ の) 駅 ま で (を 買 い->よ か り) ま し た
uttid_0KTVqevvEjo-00010934-00011163-53: ど も り を 隠 す た め に
uttid_0KTVqevvEjo-00013025-00013257-54: い ろ ん な 人 が 近 づ い て き ま し た
uttid_0KTVqevvEjo-00015716-00016059-55: (こ う し て->*) ダ ウ ジ ン グ (棒->ボ ー) を 持 っ て い く と
uttid_0KTVqevvEjo-00019479-00019875-56: (ふ と->*) そ の (喋->し ゃ べ) り 方 を (真 似->悪) し て み ま し た
uttid_0KTVqevvEjo-00021596-00021753-57: お 客 さ ん ど ち ら ま で
uttid_0KTVqevvEjo-00023495-00023640-58: (ニ ャ ー オ->*)
uttid_0KTVqevvEjo-00025255-00025454-59: (ち ぐ は ぐ->１ 箱) に な っ て し ま い ま し た
uttid_0KTVqevvEjo-00027858-00028055-60: (ペ ラ ペ ラ ペ ラ ペ ラ ペ ラ->だ か ら)
uttid_0KTVqevvEjo-00030292-00030812-61: 繰 り 返 し 繰 り 返 し 一 (心 不 乱 に->人 フ ラ ン キ レ) 練 習 し ま し た
uttid_0KTVqevvEjo-00036245-00036734-62: (す る と ま た わ 身 体->分 か ら だ) の ど こ か ら か な (ん だ->と) か ぶ ら 下 が っ て
uttid_0KTVqevvEjo-00039568-00039827-63: (ス ラ ス ラ ス ラ ス ラ ペ ラ ペ ラ ペ ラ ペ ラ->*)
uttid_0KTVqevvEjo-00043387-00044019-64: (大 笑 い し ま し->あ っ) た (は っ は っ は っ は っ は っ->*)
uttid_0KTVqevvEjo-00047529-00048071-65: (目 の 下 に ク マ->熊) を 作 っ て こ (わ->う) ば っ た (目->ん で 見) つ (き で->け た い)
uttid_0KTVqevvEjo-00051430-00051667-66: 次 の 瞬 間 (危 な->油 ぐ ら) い
uttid_0KTVqevvEjo-00055222-00055516-67: (お い->*) 生 き て る か
uttid_0KTVqevvEjo-00058721-00059045-68: (そ う->*) 難 し く 考 え る こ と は (ね ぇ->な い) ん だ よ
uttid_0KTVqevvEjo-00061889-00062417-69: (あ ん た->*) ど も り が あ っ た (な ぁ 喋 る と き->の は し ゃ べ る 時) に (つ->突) っ (か え た->取) り 繰 り 返 し た り
uttid_0KTVqevvEjo-00064327-00064762-70: 顔 だ け (巣 穴->必 要 は な) か ら 出 し て る (狸->タ ル 君) み た い に (キ ョ ロ キ ョ ロ->協 力 許) し て る よ う な や つ ま で
uttid_0KTVqevvEjo-00069173-00069528-71: (一->１) 度 切 り 離 さ れ た 影 は 二 度 と 戻 ら ね (ぇ->え) よ
uttid_0KTVqevvEjo-00072424-00072677-72: (は は 半 分 い や だ だ だ め だ め->*)

As shown above, the current recipe is prone to deletion errors, especially at the beginning of the utterance, sometimes, it even fails to recognize the entire audio at all.

Compare with our recipe on ESPnet (`conformer-transformer`):	Test	CER
In-Distribution	19.1
TEDx	22.83

It shows that the current Zipformer recipe is better for the in-distribution validation; however, it may incur severe deletion errors for other Japanese dataset.

--avg 5 is a little large for --epoch 8. Have you tried other --avg values?

Thanks for the note! Yes, we trained 30 epochs and tried every combination.

Here are the results (chunk size: 320ms, modified beam search):

epoch 8 avg 5 modified_beam_search 33.75
epoch 9 avg 7 modified_beam_search 33.78
epoch 8 avg 6 modified_beam_search 33.89
epoch 9 avg 6 modified_beam_search 33.92
epoch 7 avg 4 modified_beam_search 33.97
epoch 7 avg 3 modified_beam_search 33.98
epoch 8 avg 4 modified_beam_search 33.98
epoch 9 avg 8 modified_beam_search 34.21
epoch 9 avg 5 modified_beam_search 34.25
epoch 8 avg 7 modified_beam_search 34.37
epoch 10 avg 8 modified_beam_search 34.48
epoch 10 avg 9 modified_beam_search 34.52
epoch 5 avg 2 modified_beam_search 34.66
epoch 10 avg 7 modified_beam_search 34.69
epoch 6 avg 3 modified_beam_search 34.94
epoch 7 avg 5 modified_beam_search 34.96
epoch 10 avg 6 modified_beam_search 35.08
epoch 7 avg 6 modified_beam_search 35.1
epoch 11 avg 10 modified_beam_search 35.26
...
...
epoch 30 avg 26 modified_beam_search 46.07
epoch 29 avg 26 modified_beam_search 46.1
epoch 21 avg 1 modified_beam_search 46.16
epoch 27 avg 25 modified_beam_search 46.41
epoch 28 avg 27 modified_beam_search 46.49
epoch 30 avg 2 modified_beam_search 46.56
epoch 30 avg 27 modified_beam_search 46.63
epoch 29 avg 28 modified_beam_search 46.98
epoch 28 avg 26 modified_beam_search 47.04
epoch 25 avg 1 modified_beam_search 47.35
epoch 30 avg 29 modified_beam_search 47.62
epoch 29 avg 27 modified_beam_search 47.64
epoch 30 avg 1 modified_beam_search 48.26
epoch 30 avg 28 modified_beam_search 48.29
epoch 3 avg 1 modified_beam_search 104.57

Hi @csukuangfj

We would like to extend our gratitude for your previous feedback and guidance provided via the WeChat discussion group.

We’ve been testing another recipe, specifically focusing on the latest zipformer model (https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer). We are excited to share the details of our validation experiment:

Zipformer validation

Recipe Used: zipformer Training Data: ~300h from ReazonSpeech medium

Validation Results:

In-distribution (ReazonSpeech valid set): CER from 17.31 to 13.7, marking an improvement of nearly 21%;
Other Japanese datasets (TEDx): CER from 32.02 to 26.18, also marking an improvement around 18%;

These results indicate that the current zipformer recipe not only enhances performance for in-distribution data but also significantly boosts performance across a wider range of Japanese datasets.

Comparison (with 4 V100 16 GB):

Model Name	Model Size	In-Distribution CER	TEDx CER	Training Time (mins)
our recipe on ESPnet	115.94M	19.1	22.83	531 (33epoch)
pruned_transducer_stateless7_streaming	75.82M	17.31	32.02	280 (30epoch)
zipformer	71.36 M	13.7	26.18	267 (30epoch)

Deletion Errors in TEDx:

Significant reduction in severe deletion errors:

previously, best result with pruned_transducer_stateless7_streaming: 39,205
currently, best result with zipformer: 33,103

This represents a noticeable decrease in deletion errors, enhancing the reliability of the model in various scenarios.

Training Efficacy:

Analysis of different combinations of epoch and avg revealed a crucial finding:

Unlike previous experiments, training in later epochs consistently improved performance for both in-distribution and TEDx validation tests.This indicates that this current zipformer model benefits from extended training, which is a deviation from past trends observed in pruned_transducer_stateless7_streaming.

In light of these findings, we have a couple of queries that we hope you can shed light on:

Deletion Errors in Out-Distribution Datasets: While we observed a significant improvement with the new recipe, the model still exhibits a tendency towards deletion errors, particularly with out-distribution datasets. We would greatly appreciate any advice or strategies to mitigate this issue further.
Parameter Change: We noted a change in the latest zipformer parameters, where --decode-chunk-len has been replaced with --chunk-size. Could you please provide some context regarding this change and how it might impact the model's performance?

We are immensely grateful for the support and dedication of you and the entire team. Our goal is to refine and optimize this recipe to its highest potential, and upon achieving this, we are enthusiastic about contributing it back to the community soon.

Thank you once again for your time and exceptional support!

@Triplecq For deletion errors, could you have a look at https://github.com/k2-fsa/icefall/pull/1130#issuecomment-1878299568

It suggests that it helps with blank penalty.

where --decode-chunk-len has been replaced with --chunk-size

@yaozengwei Could you help answer it?

k2-fsa / icefall