Early Stopping of Token Generation in Streaming Model Training

Triplecq commented 1 month ago

Hi Next-gen Kaldi team,

Thank you once again for your continuous support and patience with our Japanese ASR recipe and model developments.

We're currently training the streaming model based on our existing recipe, ReazonSpeech. Despite experimenting with both the regular zipformer and zipformer-L across different datasets (100h, 1000h, and 5000h), we've encountered a consistent issue where the output tends to generate only the first few tokens.

Current environment:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

$ python3 -c "import torch; print(torch.__version__)"
2.3.1+cu121

$ python3 -c "import torchaudio; print(torchaudio.__version__)"
2.3.1+cu121

$ python3 -m k2.version
Collecting environment information...

k2 version: 1.24.4
Build type: Release
Git SHA1: 8f976a1e1407e330e2a233d68f81b1eb5269fdaa
Git date: Thu Jun 6 02:13:08 2024
Cuda used to build k2: 12.1
cuDNN used to build k2: 
Python version used to build k2: 3.10
OS used to build k2: CentOS Linux release 7.9.2009 (Core)
CMake version: 3.29.3
GCC version: 9.3.1
CMAKE_CUDA_FLAGS: -Wno-deprecated-gpu-targets -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_50,code=sm_50 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_60,code=sm_60 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_61,code=sm_61 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_70,code=sm_70 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 2.3.1+cu121
PyTorch is using Cuda: 12.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /usr/local/lib/python3.10/dist-packages/k2/version/version.py
_k2.__file__: /usr/local/lib/python3.10/dist-packages/_k2.cpython-310-x86_64-linux-gnu.so

$ python3 -c "import lhotse; print(lhotse.__version__)"
1.26.0.dev+git.bd12d5d.clean

Our commands and results:

Training command (regular zipformer):

./zipformer/train.py \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp-causal \
  --causal 1 \
  --lang data/lang_char \
  --max-duration 1600

Decoding command:

./zipformer/streaming_decode.py \
   --epoch 30 \
   --avg 15 \
   --causal 1 \
   --chunk-size 32 \
   --left-context-frames 128 \
   --exp-dir zipformer/exp-causal \
   --lang data/lang_char

Some results from errs-test-greedy_search-epoch-30-avg-15-chunk-32-left-context-128-use-averaged-model.txt:

1000-0: (ライブ映像です菅総理のコメントがこれから発表されます->そ)
1001-1: (日経平均株価の午前の終値二万八千八十一円五十五銭と七十四円六十六銭->日)
1002-2: (来年の大統領選挙を控える中で四件目の起訴を受けたわけですが今回も相変わらず選挙妨害だなどと無実を主張しています->ラ)
1003-3: (膿の除去や歯周病の原因となる歯石の除去などのケアを続けたのです->こ)
1004-4: (まずは東京都心のお天気の変化から見てみましょう->ま)
1005-5: (ご準備お願いいたします->でも)
1006-6: (だって上いったら筋見えるよ->だって)
1007-7: (ロシアの潜水艦が日本海でミサイル発射の演習を行いました->ここか)
1008-8: (まあまあまあでもさこれもほらあのトカゲが急に敵におそわれたときしっぽちょん切ってにげるみてえな感じだから->ま)

We also exported this model and tested with sherpa-onnx.

exporting command:

./zipformer/export-onnx-streaming.py \
  --tokens data/lang_char/tokens.txt \
  --use-averaged-model 0 \
  --epoch 99 \
  --avg 1 \
  --exp-dir zipformer/exp-causal \
  --causal True \
  --chunk-size 16 \
  --left-context-frames 128 \
  --fp16 True

Decoding with Python API examples:

./python-api-examples/online-decode-files.py \
  --tokens=./pretrained-models/k2-streaming/tokens.txt \
  --num-threads=4 \
  --encoder=./pretrained-models/k2-streaming/1000h/encoder-epoch-99-avg-1-chunk-16-left-128.onnx \
  --decoder=./pretrained-models/k2-streaming/1000h/decoder-epoch-99-avg-1-chunk-16-left-128.onnx \
  --joiner=./pretrained-models/k2-streaming/1000h/joiner-epoch-99-avg-1-chunk-16-left-128.onnx \
  ./pretrained-models/k2-streaming/test_wavs/0.wav \
  ./pretrained-models/k2-streaming/test_wavs/1.wav
Started!
Done!
./pretrained-models/k2-streaming/test_wavs/0.wav
ら
----------
./pretrained-models/k2-streaming/test_wavs/1.wav
屯
----------
num_threads: 4
decoding_method: greedy_search
Wave duration: 23.340 s
Elapsed time: 1.159 s
Real time factor (RTF): 1.159/23.340 = 0.050

The outputs we're seeing from both the streaming_decode.py and the sherpa-onnx deployed models are truncated early in the speech, leading to significantly shortened or incomplete transcriptions.

We would greatly appreciate any insights or suggestions on how to address these early stopping issues in token generation. We will also open-source this streaming model as soon as we resolve these challenges.

Thank you!

csukuangfj commented 1 month ago

@yaozengwei Could you have a look?

csukuangfj commented 1 month ago

Could you show the tensorboard logs?

Triplecq commented 1 month ago

Thanks for your reply! Please find the attached logs:

csukuangfj commented 1 month ago

could you tell us what is the scale of the final loss, e.g., 0.5 or 0.05?

Also, have you tried to decode some of the training data?

Triplecq commented 3 weeks ago

Thanks for your reply!

I decoded the entire training data over the weekend. Simply put, the behavior is the same—only a few tokens are generated in the output. Here're some details:

%WER = 99.71
Errors: 0 insertions, 0 deletions, 616256 substitutions, over 618036 reference words (1780 correct)
Search below for sections starting with PER-UTT DETAILS:, SUBSTITUTIONS:, DELETIONS:, INSERTIONS:, PER-WORD STATS:

PER-UTT DETAILS: corr or (ref->hyp)
10000-8900:     (暑さの影響は厨房にも->熱)
100000-98900:   (ナンバーワンツースリー赤イギリスです->ナン)
100001-98901:   (どう抑えるかよりどう点を取るか->抑)
100002-98902:   (競走馬の育成についてプロ顔負けの知識を持っていたという秋本容疑者->し)
100003-98903:   (桜の実が赤くなってむらさきになってやっと生まれた赤ちゃんっていう感じなので本当に待ちに待ってずっと待ってやっとやっと生>まれたねっていう喜びを表してらっしゃるのかなと思いました->も)
100004-98904:   (値段変えてもいいですね->正)
100005-98905:   (藤沢女流本因坊もハンマーを持ってたということですか->そ)
100006-98906:   (そこで開発したのが驚きの鮮度回復ワザ->こ)
100007-98907:   (昭和世代・Z世代ともに一位となったのはブラックビスケッツでなるほど->正)
100008-98908:   (もっともっともっともっと前->も)

Note: for those are considered correct, because the original speech is also very short with only one or two tokens.

Triplecq commented 3 weeks ago

For the loss, are you referring the simple_loss_scale? We used the default simple_loss_scale=0.5 in all experiments.

Here's the training log:

2024-08-05 14:31:50,498 INFO [train.py:1099] (0/8) Training starte
2024-08-05 14:31:50,500 INFO [train.py:1109] (0/8) Device: cuda:0
2024-08-05 14:31:50,504 INFO [train.py:1120] (0/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '8f976a1e1407e330e2a233d68f81b1eb5269fdaa', 'k2-git-date': 'Thu Jun 6 02:13:08 2024', 'lhotse-version': '1.26.0.dev+git.bd12d5d.clean', 'torch-version': '2.3.1+cu121', 'torch-cuda-available': True, 'torch-cuda-version': '12.1', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '4af81af5-dirty', 'icefall-git-date': 'Thu Jul 18 22:05:59 2024', 'icefall-path': '/root/k2/tmp/icefall', 'k2-path': '/usr/local/lib/python3.10/dist-packages/k2/__init__.py', 'lhotse-path': '/usr/local/lib/python3.10/dist-packages/lhotse/__init__.py', 'hostname': 'KDA01', 'IP address': '192.168.0.2'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-causal'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.015, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'manifest_dir': PosixPath('data/manifests'), 'max_duration': 1600, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': False, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'lang': PosixPath('data/lang_char'), 'lang_type': None, 'blank_id': 0, 'vocab_size': 3878}

If this is not what you're asking for, please let me know where I can find the specific parameter for you. Thank you so much!

csukuangfj commented 3 weeks ago

what is the final pruned loss? Could you also upload the text log file?

Triplecq commented 3 weeks ago

Thanks for your quick reply!

Do you mean the following log:

2024-08-05 18:46:38,723 INFO [train.py:1031] (0/8) Epoch 30, batch 250, loss[loss=0.2664, simple_loss=0.2714, pruned_loss=0.1307, over 38129.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.2704, pruned_loss=0.1156, over 5373983.34 frames. ], batch size: 364, lr: 4.18e-03, grad_scale: 4.0
2024-08-05 18:47:09,088 INFO [scaling.py:1024] (0/8) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.31 vs. limit=15.0
2024-08-05 18:47:32,369 INFO [scaling.py:214] (0/8) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=188480.0, ans=0.125
2024-08-05 18:47:51,635 INFO [checkpoint.py:75] (0/8) Saving checkpoint to zipformer/exp-causal/epoch-30.pt
2024-08-05 18:47:56,732 INFO [train.py:1283] (0/8) Done!

I can also upload the entire log if it helps. Thank you!

csukuangfj commented 3 weeks ago

2024-08-05 18:46:38,723 INFO [train.py:1031] (0/8) Epoch 30, batch 250, loss[loss=0.2664, simple_loss=0.2714, pruned_loss=0.1307, over 38129.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.2704, pruned_loss=0.1156, over 5373983.34 frames. ], batch size: 364, lr: 4.18e-03, grad_scale: 4.0

The pruned loss is a bit high. What is the pruned loss of your non-streaming model?

./zipformer/streaming_decode.py \
   --epoch 30 \
   --avg 15 \

Have you tried other combinations instead of --epoch 30 --avg 15? E.g., --epoch 30 --avg 1 etc.

Triplecq commented 3 weeks ago

Thanks for these great points! Let me check those numbers and I will get back to you very soon!

Triplecq commented 3 weeks ago

For our previous experiment with non-streaming model, the pruned loss is only 0.02325. Here is the detail:

2024-03-06 21:54:34,277 INFO [train.py:1031] (0/8) Epoch 40, batch 10500, loss[loss=0.111, simple_loss=0.1756, pruned_loss=0.02325, over 38581.00 frames. ], tot_loss[loss=0.1188, simple_loss=0.179, pruned_loss=0.02927, over 7528702.64 frames. ], batch size: 135, lr: 5.98e-04, grad_scale: 2.0

I just tried with --epoch 30 --avg 1 and --epoch 15 --avg 1, but got basically the same result.

csukuangfj commented 3 weeks ago

Do you only change --causal 0 to --causal 1 to train the streaming model and everything else is the same as the non-streaming model?

Triplecq commented 3 weeks ago

Do you only change --causal 0 to --causal 1 to train the streaming model and everything else is the same as the non-streaming model?

Yes! I didn't change anything else other than this parameter.

csukuangfj commented 3 weeks ago

Could you use https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/local/display_manifest_statistics.py to gather the statistics of your data?

Triplecq commented 3 weeks ago

Sure. Here's the detailed statistics of our 1000h experiment:

---------------------------------

reazonspeech_cuts_train.jsonl.gz:
Cut statistics:
╒═══════════════════════════╤═══════════╕
│ Cuts count:               │ 618036    │
├───────────────────────────┼───────────┤
│ Total duration (hh:mm:ss) │ 998:10:47 │
├───────────────────────────┼───────────┤
│ mean                      │ 5.8       │
├───────────────────────────┼───────────┤
│ std                       │ 4.1       │
├───────────────────────────┼───────────┤
│ min                       │ 0.4       │
├───────────────────────────┼───────────┤
│ 25%                       │ 2.8       │
├───────────────────────────┼───────────┤
│ 50%                       │ 4.8       │
├───────────────────────────┼───────────┤
│ 75%                       │ 7.7       │
├───────────────────────────┼───────────┤
│ 99%                       │ 20.1      │
├───────────────────────────┼───────────┤
│ 99.5%                     │ 22.7      │
├───────────────────────────┼───────────┤
│ 99.9%                     │ 27.4      │
├───────────────────────────┼───────────┤
│ max                       │ 30.0      │
├───────────────────────────┼───────────┤
│ Recordings available:     │ 618036    │
├───────────────────────────┼───────────┤
│ Features available:       │ 618036    │
├───────────────────────────┼───────────┤
│ Supervisions available:   │ 618036    │
╘═══════════════════════════╧═══════════╛
Speech duration statistics:
╒══════════════════════════════╤═══════════╤══════════════════════╕
│ Total speech duration        │ 998:10:47 │ 100.00% of recording │
├──────────────────────────────┼───────────┼──────────────────────┤
│ Total speaking time duration │ 998:10:47 │ 100.00% of recording │
├──────────────────────────────┼───────────┼──────────────────────┤
│ Total silence duration       │ 00:00:00  │ 0.00% of recording   │
╘══════════════════════════════╧═══════════╧══════════════════════╛

---------------------------------

reazonspeech_cuts_dev.jsonl.gz:
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count:               │ 1000     │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 01:33:25 │
├───────────────────────────┼──────────┤
│ mean                      │ 5.6      │
├───────────────────────────┼──────────┤
│ std                       │ 3.7      │
├───────────────────────────┼──────────┤
│ min                       │ 0.6      │
├───────────────────────────┼──────────┤
│ 25%                       │ 2.9      │
├───────────────────────────┼──────────┤
│ 50%                       │ 4.8      │
├───────────────────────────┼──────────┤
│ 75%                       │ 7.4      │
├───────────────────────────┼──────────┤
│ 99%                       │ 18.8     │
├───────────────────────────┼──────────┤
│ 99.5%                     │ 20.0     │
├───────────────────────────┼──────────┤
│ 99.9%                     │ 28.3     │
├───────────────────────────┼──────────┤
│ max                       │ 29.1     │
├───────────────────────────┼──────────┤
│ Recordings available:     │ 1000     │
├───────────────────────────┼──────────┤
│ Features available:       │ 1000     │
├───────────────────────────┼──────────┤
│ Supervisions available:   │ 1000     │
╘═══════════════════════════╧══════════╛
Speech duration statistics:
╒══════════════════════════════╤══════════╤══════════════════════╕
│ Total speech duration        │ 01:33:25 │ 100.00% of recording │
├──────────────────────────────┼──────────┼──────────────────────┤
│ Total speaking time duration │ 01:33:25 │ 100.00% of recording │
├──────────────────────────────┼──────────┼──────────────────────┤
│ Total silence duration       │ 00:00:00 │ 0.00% of recording   │
╘══════════════════════════════╧══════════╧══════════════════════╛

csukuangfj commented 3 weeks ago

Could you also try https://github.com/k2-fsa/icefall/blob/master/egs/reazonspeech/ASR/zipformer/decode.py and see what the WER is?

Triplecq commented 3 weeks ago

Thanks for the suggestion! I decoded with the following command:

./zipformer/decode.py \
    --epoch 30 \
    --avg 15 \
    --exp-dir ./zipformer/exp-causal \
    --causal 1 \
    --chunk-size 32 \
    --left-context-frames 128 \
    --max-duration 1600 \
    --decoding-method greedy_search \
    --lang data/lang_char

%WER = 58.15
Errors: 563 insertions, 10658 deletions, 1668 substitutions, over 22164 reference words (9838 correct)

0-0: ref=['こ', 'れ', 'ま', 'た', 'ジ', 'ミ', 'ー', 'さ', 'ん']
0-0: hyp=['こ', 'れ', 'ま', 'で', 'ジ', 'ミ', 'さ', 'ん']

1-1: ref=['今', 'も', '相', '手', 'に', 'ロ', 'ン', 'バ', 'ル', 'ド', 'の', 'ほ', 'う', 'に', '肩', '口', 'で', '握', 'ら', 'れ', 'て', 'も', 'す', 'ぐ', 'さ', 'ま', '流', 'れ', 'を', '切', 'る', '引', 'き', '込', 'み', '返', 'し', 'に', '変', 'え', 'た', 'と']
1-1: hyp=['今', 'も', '相', '手', 'に', 'ロ', 'ン', 'バ', 'ル', 'ト', 'の', 'ほ', 'う', 'に', '貴', '子', 'ら', 'れ', 'て', 'も', 'す', 'ぐ', 'す', 'ぐ', 'さ', 'ま', '流', 'れ', 'を', '切', 'る', '返', 'し', 'に', '切', 'り', '替', 'え', 'た', 'と']

10-10: ref=['予', '定', 'を', '大', '幅', 'に', '狂', 'わ', 'せ', 'る', '交', '通', '機', '関', 'の', '乱', 'れ']
10-10: hyp=['こ']

100-100: ref=['矢', '部', 'さ', 'ん', 'で', 'プ', 'ラ', 'ス', '二', '千', '六', '百', '円', 'で', 'す']
100-100: hyp=['そ']

101-101: ref=['現', '場', 'に', 'お', '任', 'せ', '頂', 'け', 'る', 'と', 'い', 'う', '約', '束', 'で', 'す']
101-101: hyp=['現', '場', 'に', 'お', '任', 'せ', 'い', 'た', 'だ', 'け', 'る', 'と', 'い', 'う', '約', '束', 'で', 'す']

It does generate more tokens this time!

According to the documentation, this simulate streaming decoding in decode.py should produce almost the same result as real chunk-wise streaming decoding in streaming_decode.py, right? Does this imply that there're something wrong in our streaming_decode.py or export-onnx-streaming.py script?

Triplecq commented 3 weeks ago

By the way, we happened to decode a longer audio (60 secs) with the aforementioned streaming model. Curiously, it worked this time! Here's the detail:

./python-api-examples/online-decode-files.py \
  --tokens=./pretrained-models/k2-streaming/1000h/tokens.txt \
  --encoder=./pretrained-models/k2-streaming/1000h/encoder-epoch-99-avg-1-chunk-16-left-128.onnx \
  --decoder=./pretrained-models/k2-streaming/1000h/decoder-epoch-99-avg-1-chunk-16-left-128.onnx \
  --joiner=./pretrained-models/k2-streaming/1000h/joiner-epoch-99-avg-1-chunk-16-left-128.onnx \
  /Users/qi_chen/Documents/work/asr/validation/tmp/Akazukinchan-60s.wav

Started!
Done!
/Users/qi_chen/Documents/work/asr/validation/tmp/Akazukinchan-60s.wav
それはだれだってちょいとみたがでもだれよりもカレよりもこの子のおばあさんほどこの子をかわいがっているものはなくこの子を見ると何もかもやりたくてやりたくて一体何をやっているのかわからなくなるくらいでしたえてありましたさあちょいといらっしゃい赤ずきんここにお菓子が一つが一人ありますがこれをあげるときっと元気だ
----------
num_threads: 1
decoding_method: greedy_search
Wave duration: 60.000 s
Elapsed time: 4.128 s
Real time factor (RTF): 4.128/60.000 = 0.069

However, this model still doesn't work with ./build/bin/sherpa-onnx-microphone or any other Python API examples using the microphone. It only generates one token and then stops...

Hope this helps!

csukuangfj commented 3 weeks ago

By the way, we happened to decode a longer audio (60 secs) with the aforementioned streaming model. Curiously, it worked this time! Here's the detail:

./python-api-examples/online-decode-files.py \
  --tokens=./pretrained-models/k2-streaming/1000h/tokens.txt \
  --encoder=./pretrained-models/k2-streaming/1000h/encoder-epoch-99-avg-1-chunk-16-left-128.onnx \
  --decoder=./pretrained-models/k2-streaming/1000h/decoder-epoch-99-avg-1-chunk-16-left-128.onnx \
  --joiner=./pretrained-models/k2-streaming/1000h/joiner-epoch-99-avg-1-chunk-16-left-128.onnx \
  /Users/qi_chen/Documents/work/asr/validation/tmp/Akazukinchan-60s.wav

Started!
Done!
/Users/qi_chen/Documents/work/asr/validation/tmp/Akazukinchan-60s.wav
それはだれだってちょいとみたがでもだれよりもカレよりもこの子のおばあさんほどこの子をかわいがっているものはなくこの子を見ると何もかもやりたくてやりたくて一体何をやっているのかわからなくなるくらいでしたえてありましたさあちょいといらっしゃい赤ずきんここにお菓子が一つが一人ありますがこれをあげるときっと元気だ
----------
num_threads: 1
decoding_method: greedy_search
Wave duration: 60.000 s
Elapsed time: 4.128 s
Real time factor (RTF): 4.128/60.000 = 0.069

However, this model still doesn't work with ./build/bin/sherpa-onnx-microphone or any other Python API examples using the microphone. It only generates one token and then stops...

Hope this helps!

Is it possible to share those *.onnx models so that we can debug it locally?

csukuangfj commented 3 weeks ago

Thanks for the suggestion! I decoded with the following command:

./zipformer/decode.py \
    --epoch 30 \
    --avg 15 \
    --exp-dir ./zipformer/exp-causal \
    --causal 1 \
    --chunk-size 32 \
    --left-context-frames 128 \
    --max-duration 1600 \
    --decoding-method greedy_search \
    --lang data/lang_char

%WER = 58.15
Errors: 563 insertions, 10658 deletions, 1668 substitutions, over 22164 reference words (9838 correct)

0-0: ref=['こ', 'れ', 'ま', 'た', 'ジ', 'ミ', 'ー', 'さ', 'ん']
0-0: hyp=['こ', 'れ', 'ま', 'で', 'ジ', 'ミ', 'さ', 'ん']

1-1: ref=['今', 'も', '相', '手', 'に', 'ロ', 'ン', 'バ', 'ル', 'ド', 'の', 'ほ', 'う', 'に', '肩', '口', 'で', '握', 'ら', 'れ', 'て', 'も', 'す', 'ぐ', 'さ', 'ま', '流', 'れ', 'を', '切', 'る', '引', 'き', '込', 'み', '返', 'し', 'に', '変', 'え', 'た', 'と']
1-1: hyp=['今', 'も', '相', '手', 'に', 'ロ', 'ン', 'バ', 'ル', 'ト', 'の', 'ほ', 'う', 'に', '貴', '子', 'ら', 'れ', 'て', 'も', 'す', 'ぐ', 'す', 'ぐ', 'さ', 'ま', '流', 'れ', 'を', '切', 'る', '返', 'し', 'に', '切', 'り', '替', 'え', 'た', 'と']

10-10: ref=['予', '定', 'を', '大', '幅', 'に', '狂', 'わ', 'せ', 'る', '交', '通', '機', '関', 'の', '乱', 'れ']
10-10: hyp=['こ']

100-100: ref=['矢', '部', 'さ', 'ん', 'で', 'プ', 'ラ', 'ス', '二', '千', '六', '百', '円', 'で', 'す']
100-100: hyp=['そ']

101-101: ref=['現', '場', 'に', 'お', '任', 'せ', '頂', 'け', 'る', 'と', 'い', 'う', '約', '束', 'で', 'す']
101-101: hyp=['現', '場', 'に', 'お', '任', 'せ', 'い', 'た', 'だ', 'け', 'る', 'と', 'い', 'う', '約', '束', 'で', 'す']

It does generate more tokens this time!

According to the documentation, this simulate streaming decoding in decode.py should produce almost the same result as real chunk-wise streaming decoding in streaming_decode.py, right? Does this imply that there're something wrong in our streaming_decode.py or export-onnx-streaming.py script?

Could you try different --blank-penalty and see if it helps? https://github.com/k2-fsa/icefall/blob/3fc06cc2b9120a79a3e061bf35cef8d7220a42f3/egs/reazonspeech/ASR/zipformer/decode.py#L375

(You can search for blank penalty in icefall's issues)

Triplecq commented 3 weeks ago

Sure, and really appreciate your help!

Please find the model and its variations in: https://huggingface.co/reazon-research/k2-streaming/tree/main/1000h

Please also let me know when you finish downloading the model, so I can change it back to private mode. Thank you!

Triplecq commented 3 weeks ago

Could you try different --blank-penalty and see if it helps?

Yes, we tried with --blank-penalty and that's how it looks:

blank-penalty=10

1000-0: (ライブ映像です菅総理のコメントがこれから発表されます->そしているのは〈〈〈〈続)
1001-1: (日経平均株価の午前の終値二万八千八十一円五十五銭と七十四円六十六銭->日本でも例えばあったらいただ日本)
1002-2: (来年の大統領選挙を控える中で四件目の起訴を受けたわけですが今回も相変わらず選挙妨害だなどと無実を主張しています->ラ)
1003-3: (膿の除去や歯周病の原因となる歯石の除去などのケアを続けたのです->こ)
1004-4: (まずは東京都心のお天気の変化から見てみましょう->またまたまたまたまた)
1005-5: (ご準備お願いいたします->でもう一度もあっでもしれちゃいま)
1006-6: (だって上いったら筋見えるよ->だったということですねだっててで)
1007-7: (ロシアの潜水艦が日本海でミサイル発射の演習を行いました->ここからここはこここから)
1008-8: (まあまあまあでもさこれもほらあのトカゲが急に敵におそわれたときしっぽちょん切ってにげるみてえな感じだから->まあまりますねもうま>いまあまあま)

Generally speaking, it does generate more tokens; however, most of them are nonsense and not even close to the actual speech...

csukuangfj commented 3 weeks ago

Sure, and really appreciate your help!

Please find the model and its variations in: https://huggingface.co/reazon-research/k2-streaming/tree/main/1000h

Please also let me know when you finish downloading the model, so I can change it back to private mode. Thank you!

Thanks! I have downloaded them.

csukuangfj commented 3 weeks ago

Please try a smaller --blank-penalty, e.g., 0.5. You can try several of them, e.g., 0.7, 1.0, 0.1, etc.

Triplecq commented 3 weeks ago

Please try a smaller --blank-penalty, e.g., 0.5. You can try several of them, e.g., 0.7, 1.0, 0.1, etc.

Thanks for the suggestions! Yes, we tried with smaller values but it didn't work...it didn't generate extra tokens or only one or two more tokens.

csukuangfj commented 3 weeks ago

I find that vocab_size of the model trained with the 1000h of data is 3878.

However, the non-streaming reazonspeech model's vocab size is 5224.

Do you prepare the reazonspeech dataset differently for the non-streaming zipformer and streaming zipformer? Is there a reason that they have a different vocab_size?

csukuangfj commented 2 days ago

By the way, https://github.com/k2-fsa/icefall/issues/1724#issuecomment-2346018714 is very similar to this issue. @Triplecq Could you check whether there are issues with your features?

k2-fsa / icefall

Early Stopping of Token Generation in Streaming Model Training #1717