Open Triplecq opened 8 months ago
More details here:
ReazonSpeech
)in-distribution (valid set from ReazonSpeech
):
The performance is outstanding as shown below:
Decoding Method | CER |
---|---|
greedy search | 11.67 |
modified beam search | 11.11 |
$ head -n 20 errs-valid-epoch-30-avg-10-streaming-chunk-size-32-modified_beam_search-beam-size-4-use-averaged-model.txt
%WER = 11.11
Errors: 577 insertions, 1418 deletions, 1297 substitutions, over 29630 reference words (26915 correct)
Search below for sections starting with PER-UTT DETAILS:, SUBSTITUTIONS:, DELETIONS:, INSERTIONS:, PER-WORD STATS:
PER-UTT DETAILS: corr or (ref->hyp)
640000-0: 1 1 時 1 5 分 に な り ま し た 。
640001-1: ニ ュ ー ス を お 伝 え し ま す 。
640002-2: 月 が 地 球 の (影->陰) に 覆 わ れ る (皆 既 月 食->海 域) が 3 年 ぶ り に 日 本 で 見 ら れ ま し た 。
640003-3: 皆 既 月 食 は 太 陽 と 地 球 と 月 が 一 直 線 に 並 び (*->、) 満 月 が 地 球 の (影->陰) に 完 全 に 覆 わ れ る 現 象 で す 。
640004-4: 午 後 6 時 4 4 分 ご ろ か ら 満 月 が (欠->か) け 始 め 午 後 8 時 9 分 ご ろ か ら 午 後 8 時 2 8 分 ご ろ ま で の お よ そ 1 9 分 間 (*->。)
640005-5: 完 全 に (影->陰) に (覆->追) わ れ て 皆 既 月 食 と な り ま し た 。
640006-6: き ょ う は 一 般 に ス ー パ ー ム ー ン と 呼 ば れ る 満 月 と し て は 1 年 で 最 も 地 球 (で->に) 近 づ く 日 で も あ り 最 も 遠 く に あ る 満 月 と 比 べ て 見 (掛->か) け の 直 径 が 1 4 (パ ー セ ン ト->%) 大 き く 見 え ま す 。
640007-7: 晴 れ 間 が 広 が っ た 東 北 や 北 海 道 (*->。)
640008-8: そ れ に 小 笠 原 諸 島 な ど の 各 地 で 観 測 さ れ ま し た 。
640009-9: 国 立 天 文 台 に よ り ま す と 次 に 日 本 で 皆 既 月 食 が 見 ら れ る の は 来 年 1 1 月 8 日 で 部 分 月 食 は こ と し 1 1 月 1 9 日 に 観 測 で き る と い う こ と で す 。
640010-10: 新 型 コ ロ ナ ウ イ ル ス の 影 響 で 倒 産 し た 企 業 の 数 が 去 年 2 月 か ら の 累 計 で 1 5 0 0 社 に な り ま し た 。
however, the performance on other Japanese dataset drops dramatically, e.g., over 50% CER
against JSUT-BASIC5000
corpus.
ReazonSpeech medium
)in-distribution (valid set from ReazonSpeech medium
):
errs-valid-epoch-30-avg-12-streaming-chunk-size-32-context-2-max-sym-per-frame-1-use-averaged-model.txt
%WER 17.31% [4381 / 25312, 988 ins, 1576 del, 1817 sub ]
greedy_search 17.31 best for valid
TEDx:
errs-valid-epoch-30-avg-12-streaming-chunk-size-32-modified_beam_search-beam-size-4-use-averaged-model.txt
%WER 42.37% [81230 / 191731, 2024 ins, 68295 del, 10911 sub ]
beam_size_4 42.37 best for valid
More results: | Chunk Size (ms) | Decoding Method | Params | CER |
---|---|---|---|---|
320 | greedy search | --epoch 8 --avg 6 | 37.43 | |
320 | modified beam search | --epoch 8 --avg 5 | 33.75 | |
640 | greedy search | --epoch 8 --avg 6 | 35.57 | |
640 | modified beam search | --epoch 8 --avg 5 | 32.02 |
Details of errs-valid-epoch-8-avg-5-streaming-chunk-size-64-modified_beam_search-beam-size-4-use-averaged-model.txt
:
2024-01-10 07:53:14,758 INFO [utils.py:641] [valid-beam_size_4] %WER 32.02% [61400 / 191731, 3421 ins, 39205 del, 18774 sub ]
2024-01-10 07:53:15,516 INFO [decode.py:617] Wrote detailed error stats to exp/modified_beam_search/errs-valid-epoch-8-avg-5-streaming-chunk-size-64-modified_beam_search-beam-size-4-use-averaged-model.txt
2024-01-10 07:53:15,516 INFO [decode.py:631]
For valid, WER of different settings are:
beam_size_4 32.02 best for valid
uttid_0KTVqevvEjo-00006950-00007305-51: 順 番 に 並 ん で (い->*) る (そ ば->祖 母) か ら (ソ ワ ソ ワ->捜 査 は) し て い ま し た
uttid_0KTVqevvEjo-00009457-00010077-52: (本 当 は 二 ノ 駅 な の に->*) そ の 先 の (三 ノ->3 の) 駅 ま で (を 買 い->よ か り) ま し た
uttid_0KTVqevvEjo-00010934-00011163-53: ど も り を 隠 す た め に
uttid_0KTVqevvEjo-00013025-00013257-54: い ろ ん な 人 が 近 づ い て き ま し た
uttid_0KTVqevvEjo-00015716-00016059-55: (こ う し て->*) ダ ウ ジ ン グ (棒->ボ ー) を 持 っ て い く と
uttid_0KTVqevvEjo-00019479-00019875-56: (ふ と->*) そ の (喋->し ゃ べ) り 方 を (真 似->悪) し て み ま し た
uttid_0KTVqevvEjo-00021596-00021753-57: お 客 さ ん ど ち ら ま で
uttid_0KTVqevvEjo-00023495-00023640-58: (ニ ャ ー オ->*)
uttid_0KTVqevvEjo-00025255-00025454-59: (ち ぐ は ぐ->1 箱) に な っ て し ま い ま し た
uttid_0KTVqevvEjo-00027858-00028055-60: (ペ ラ ペ ラ ペ ラ ペ ラ ペ ラ->だ か ら)
uttid_0KTVqevvEjo-00030292-00030812-61: 繰 り 返 し 繰 り 返 し 一 (心 不 乱 に->人 フ ラ ン キ レ) 練 習 し ま し た
uttid_0KTVqevvEjo-00036245-00036734-62: (す る と ま た わ 身 体->分 か ら だ) の ど こ か ら か な (ん だ->と) か ぶ ら 下 が っ て
uttid_0KTVqevvEjo-00039568-00039827-63: (ス ラ ス ラ ス ラ ス ラ ペ ラ ペ ラ ペ ラ ペ ラ->*)
uttid_0KTVqevvEjo-00043387-00044019-64: (大 笑 い し ま し->あ っ) た (は っ は っ は っ は っ は っ->*)
uttid_0KTVqevvEjo-00047529-00048071-65: (目 の 下 に ク マ->熊) を 作 っ て こ (わ->う) ば っ た (目->ん で 見) つ (き で->け た い)
uttid_0KTVqevvEjo-00051430-00051667-66: 次 の 瞬 間 (危 な->油 ぐ ら) い
uttid_0KTVqevvEjo-00055222-00055516-67: (お い->*) 生 き て る か
uttid_0KTVqevvEjo-00058721-00059045-68: (そ う->*) 難 し く 考 え る こ と は (ね ぇ->な い) ん だ よ
uttid_0KTVqevvEjo-00061889-00062417-69: (あ ん た->*) ど も り が あ っ た (な ぁ 喋 る と き->の は し ゃ べ る 時) に (つ->突) っ (か え た->取) り 繰 り 返 し た り
uttid_0KTVqevvEjo-00064327-00064762-70: 顔 だ け (巣 穴->必 要 は な) か ら 出 し て る (狸->タ ル 君) み た い に (キ ョ ロ キ ョ ロ->協 力 許) し て る よ う な や つ ま で
uttid_0KTVqevvEjo-00069173-00069528-71: (一->1) 度 切 り 離 さ れ た 影 は 二 度 と 戻 ら ね (ぇ->え) よ
uttid_0KTVqevvEjo-00072424-00072677-72: (は は 半 分 い や だ だ だ め だ め->*)
As shown above, the current recipe is prone to deletion
errors, especially at the beginning of the utterance, sometimes, it even fails to recognize the entire audio at all.
Compare with our recipe on ESPnet (conformer-transformer ): |
Test | CER |
---|---|---|
In-Distribution | 19.1 | |
TEDx | 22.83 |
It shows that the current Zipformer recipe is better for the in-distribution validation; however, it may incur severe deletion
errors for other Japanese dataset.
--avg 5 is a little large for --epoch 8. Have you tried other --avg values?
Thanks for the note! Yes, we trained 30 epochs and tried every combination.
Here are the results (chunk size: 320ms, modified beam search):
epoch 8 avg 5 modified_beam_search 33.75
epoch 9 avg 7 modified_beam_search 33.78
epoch 8 avg 6 modified_beam_search 33.89
epoch 9 avg 6 modified_beam_search 33.92
epoch 7 avg 4 modified_beam_search 33.97
epoch 7 avg 3 modified_beam_search 33.98
epoch 8 avg 4 modified_beam_search 33.98
epoch 9 avg 8 modified_beam_search 34.21
epoch 9 avg 5 modified_beam_search 34.25
epoch 8 avg 7 modified_beam_search 34.37
epoch 10 avg 8 modified_beam_search 34.48
epoch 10 avg 9 modified_beam_search 34.52
epoch 5 avg 2 modified_beam_search 34.66
epoch 10 avg 7 modified_beam_search 34.69
epoch 6 avg 3 modified_beam_search 34.94
epoch 7 avg 5 modified_beam_search 34.96
epoch 10 avg 6 modified_beam_search 35.08
epoch 7 avg 6 modified_beam_search 35.1
epoch 11 avg 10 modified_beam_search 35.26
...
...
epoch 30 avg 26 modified_beam_search 46.07
epoch 29 avg 26 modified_beam_search 46.1
epoch 21 avg 1 modified_beam_search 46.16
epoch 27 avg 25 modified_beam_search 46.41
epoch 28 avg 27 modified_beam_search 46.49
epoch 30 avg 2 modified_beam_search 46.56
epoch 30 avg 27 modified_beam_search 46.63
epoch 29 avg 28 modified_beam_search 46.98
epoch 28 avg 26 modified_beam_search 47.04
epoch 25 avg 1 modified_beam_search 47.35
epoch 30 avg 29 modified_beam_search 47.62
epoch 29 avg 27 modified_beam_search 47.64
epoch 30 avg 1 modified_beam_search 48.26
epoch 30 avg 28 modified_beam_search 48.29
epoch 3 avg 1 modified_beam_search 104.57
Hi @csukuangfj
We would like to extend our gratitude for your previous feedback and guidance provided via the WeChat discussion group.
We’ve been testing another recipe, specifically focusing on the latest zipformer
model (https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer). We are excited to share the details of our validation experiment:
Recipe Used: zipformer
Training Data: ~300h from ReazonSpeech medium
Validation Results:
ReazonSpeech
valid set): CER from 17.31 to 13.7, marking an improvement of nearly 21%;TEDx
): CER from 32.02 to 26.18, also marking an improvement around 18%;These results indicate that the current zipformer
recipe not only enhances performance for in-distribution data but also significantly boosts performance across a wider range of Japanese datasets.
Comparison (with 4 V100 16 GB):
Model Name | Model Size | In-Distribution CER | TEDx CER | Training Time (mins) |
---|---|---|---|---|
our recipe on ESPnet | 115.94M | 19.1 | 22.83 | 531 (33epoch) |
pruned_transducer_stateless7_streaming | 75.82M | 17.31 | 32.02 | 280 (30epoch) |
zipformer | 71.36 M | 13.7 | 26.18 | 267 (30epoch) |
Deletion Errors in TEDx
:
Significant reduction in severe deletion
errors:
pruned_transducer_stateless7_streaming
: 39,205zipformer
: 33,103This represents a noticeable decrease in deletion
errors, enhancing the reliability of the model in various scenarios.
Training Efficacy:
Analysis of different combinations of epoch
and avg
revealed a crucial finding:
Unlike previous experiments, training in later epochs consistently improved performance for both in-distribution and TEDx
validation tests.This indicates that this current zipformer
model benefits from extended training, which is a deviation from past trends observed in pruned_transducer_stateless7_streaming
.
In light of these findings, we have a couple of queries that we hope you can shed light on:
zipformer
parameters, where --decode-chunk-len
has been replaced with --chunk-size
. Could you please provide some context regarding this change and how it might impact the model's performance?We are immensely grateful for the support and dedication of you and the entire team. Our goal is to refine and optimize this recipe to its highest potential, and upon achieving this, we are enthusiastic about contributing it back to the community soon.
Thank you once again for your time and exceptional support!
@Triplecq For deletion errors, could you have a look at https://github.com/k2-fsa/icefall/pull/1130#issuecomment-1878299568
It suggests that it helps with blank penalty.
where --decode-chunk-len has been replaced with --chunk-size
@yaozengwei Could you help answer it?
Hi Next-gen Kaldi team,
Thank you for your detailed documentation and support through the WeChat discussion group. We have been developing a new recipe for our open-sourced Japanese corpus,
ReazonSpeech
, and encountered some issues during validation tests.Recipe Used:
pruned_transducer_stateless7_streaming
Training Data:ReazonSpeech
ReazonSpeech medium
Validation Results:
ReazonSpeech
valid set): Excellent performance for both models.JSUT-BASIC5000
): Over 50% CER, indicating a significant drop in performance.Issue Description:
deletion
errors, particularly at the start of utterances. In some cases, it fails to recognize the audio entirely.conformer-transformer
model on ESPnet with the same training data, which resulted in better handling of other Japanese datasets.Reference Issues:
We're seeking guidance and suggestions on addressing these
deletion
errors and improving the recipe's adaptability to other Japanese datasets.Thank you for your time and assistance!