k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
912 stars 292 forks source link

Pruned transducer stateless2 experiments with a big context size (such as 3 or 4) while the default=2 #381

Open luomingshuang opened 2 years ago

luomingshuang commented 2 years ago

I build this issue to explore the influence on the performance of pruned transducer stateless2 when using a big context size (default=2). I will use context size=4 for a starting experiment on librispeech 100 hours of data. And I will use context-size=2 as the baseline.

danpovey commented 2 years ago

OK. We might need to repeat this with more data, too... it's not clear to me that this is as likely to be helpful when amount of data is small. We will see.

luomingshuang commented 2 years ago

Tensorboard logs for context-size=2 and context-4: https://tensorboard.dev/experiment/twj2EejdQxSXybzFNDAZaw/#scalars&_smoothingWeight=0.74

b70723e6056bc43fa4ae9ba60f251b6

It seems that the context-size=4 can converge more quickily than context-size=2. For each epoch training, the model based on context-size=4 (model parameters: 78649064) takes 40 seconds longer than the model based on context-size=2 (model parameters: 78648040).

Some decoding results trained with librispeech 100 hours data: context-size decoding-method epoch avg test-clean test-other
2 greedy_search 29 10 6.93 17.97
4 greedy_search 29 10 6.93 17.96
2 greedy_search 29 15 6.92 17.91
4 greedy_search 29 15 6.86 17.88
2 greedy_search 29 20 6.99 18.05
4 greedy_search 29 20 6.93 18.03
2 modified_beam_search 29 15 6.73 17.44
4 modified_beam_search 29 15 6.66 17.44
2 fast_beam_search 29 15 6.86 17.48
4 fast_beam_search 29 15 6.68 17.46

According to the above table, the performance of context-size=4 is better than the context-size=2.

About the decoding duration, for greedy_search, it seems that the context-size=4 is the same as the context-size=2. There are some decoding logs:

(k2-python) luomingshuang@de-74279-k2-train-9-0425111216-65f66bdf4-bkrql:~/codes/icefall-librispeech-pruned-rnnt2-more-states-for-predict/egs/librispeech/ASR$ CUDA_VISIBLE_DEVICES='4' python pruned_transducer_stateless2/decode.py --epoch 29 --avg 15 --decoding-method greedy_search --max-duration 600 --exp-dir pruned_transducer_stateless2/exp-context-size-4 --context-size 4
2022-05-24 20:56:21,915 INFO [decode.py:477] Decoding started
2022-05-24 20:56:21,916 INFO [decode.py:483] Device: cuda:0
2022-05-24 20:56:21,919 INFO [decode.py:493] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.15.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f8d2dba06c000ffee36aab5b66f24e7c9809f116', 'k2-git-date': 'Thu Apr 21 12:20:34 2022', 'lhotse-version': '1.2.0.dev+git.de75634.dirty', 'torch-version': '1.11.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'librispeech-pruned-rnnt2-more-context-for-predict', 'icefall-git-sha1': '2f1e23c-clean', 'icefall-git-date': 'Mon May 23 14:39:11 2022', 'icefall-path': '/ceph-meixu/luomingshuang/icefall', 'k2-path': '/ceph-ms/luomingshuang/k2_latest/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.2.0.dev0+git.de75634.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-9-0425111216-65f66bdf4-bkrql', 'IP address': '10.177.77.9'}, 'epoch': 29, 'iter': 0, 'avg': 15, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp-context-size-4'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'decoding_method': 'greedy_search', 'beam_size': 4, 'beam': 4, 'max_contexts': 4, 'max_states': 8, 'context_size': 4, 'max_sym_per_frame': 1, 'full_libri': True, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 600, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'res_dir': PosixPath('pruned_transducer_stateless2/exp-context-size-4/greedy_search'), 'suffix': 'epoch-29-avg-15-context-4-max-sym-per-frame-1', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2022-05-24 20:56:21,920 INFO [decode.py:495] About to create model
2022-05-24 20:56:22,360 INFO [decode.py:523] averaging ['pruned_transducer_stateless2/exp-context-size-4/epoch-15.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-16.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-17.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-18.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-19.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-20.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-21.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-22.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-23.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-24.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-25.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-26.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-27.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-28.pt', 'pruned_transducer_stateless2/exp-context-size-4/epoch-29.pt']
2022-05-24 20:56:53,710 INFO [decode.py:537] Number of model parameters: 78649064
2022-05-24 20:56:53,711 INFO [asr_datamodule.py:422] About to get test-clean cuts
2022-05-24 20:56:53,911 INFO [asr_datamodule.py:427] About to get test-other cuts
2022-05-24 20:56:55,733 INFO [decode.py:391] batch 0/?, cuts processed until now is 123
2022-05-24 20:57:12,763 INFO [decode.py:408] The transcripts are stored in pruned_transducer_stateless2/exp-context-size-4/greedy_search/recogs-test-clean-greedy_search-epoch-29-avg-15-context-4-max-sym-per-frame-1.txt
2022-05-24 20:57:12,846 INFO [utils.py:406] [test-clean-greedy_search] %WER 6.86% [3607 / 52576, 370 ins, 357 del, 2880 sub ]
2022-05-24 20:57:13,071 INFO [decode.py:421] Wrote detailed error stats to pruned_transducer_stateless2/exp-context-size-4/greedy_search/errs-test-clean-greedy_search-epoch-29-avg-15-context-4-max-sym-per-frame-1.txt
2022-05-24 20:57:13,072 INFO [decode.py:438]
For test-clean, WER of different settings are:
greedy_search   6.86    best for test-clean

2022-05-24 20:57:13,953 INFO [decode.py:391] batch 0/?, cuts processed until now is 138
2022-05-24 20:57:29,956 INFO [decode.py:408] The transcripts are stored in pruned_transducer_stateless2/exp-context-size-4/greedy_search/recogs-test-other-greedy_search-epoch-29-avg-15-context-4-max-sym-per-frame-1.txt
2022-05-24 20:57:30,070 INFO [utils.py:406] [test-other-greedy_search] %WER 17.88% [9357 / 52343, 855 ins, 1179 del, 7323 sub ]
2022-05-24 20:57:30,472 INFO [decode.py:421] Wrote detailed error stats to pruned_transducer_stateless2/exp-context-size-4/greedy_search/errs-test-other-greedy_search-epoch-29-avg-15-context-4-max-sym-per-frame-1.txt
2022-05-24 20:57:30,473 INFO [decode.py:438]
For test-other, WER of different settings are:
greedy_search   17.88   best for test-other

2022-05-24 20:57:30,473 INFO [decode.py:565] Done!
(k2-python) luomingshuang@de-74279-k2-train-9-0425111216-65f66bdf4-bkrql:~/codes/icefall-librispeech-pruned-rnnt2-more-states-for-predict/egs/librispeech/ASR$ CUDA_VISIBLE_DEVICES='4' python pruned_transducer_stateless2/decode.py --epoch 29 --avg 15 --decoding-method greedy_search --max-duration 600 --exp-dir pruned_transducer_stateless2/exp-context-size-2 --context-size 2
2022-05-24 20:58:47,810 INFO [decode.py:477] Decoding started
2022-05-24 20:58:47,811 INFO [decode.py:483] Device: cuda:0
2022-05-24 20:58:47,814 INFO [decode.py:493] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.15.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f8d2dba06c000ffee36aab5b66f24e7c9809f116', 'k2-git-date': 'Thu Apr 21 12:20:34 2022', 'lhotse-version': '1.2.0.dev+git.de75634.dirty', 'torch-version': '1.11.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'librispeech-pruned-rnnt2-more-context-for-predict', 'icefall-git-sha1': '2f1e23c-clean', 'icefall-git-date': 'Mon May 23 14:39:11 2022', 'icefall-path': '/ceph-meixu/luomingshuang/icefall', 'k2-path': '/ceph-ms/luomingshuang/k2_latest/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.2.0.dev0+git.de75634.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-9-0425111216-65f66bdf4-bkrql', 'IP address': '10.177.77.9'}, 'epoch': 29, 'iter': 0, 'avg': 15, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp-context-size-2'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'decoding_method': 'greedy_search', 'beam_size': 4, 'beam': 4, 'max_contexts': 4, 'max_states': 8, 'context_size': 2, 'max_sym_per_frame': 1, 'full_libri': True, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 600, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'res_dir': PosixPath('pruned_transducer_stateless2/exp-context-size-2/greedy_search'), 'suffix': 'epoch-29-avg-15-context-2-max-sym-per-frame-1', 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2022-05-24 20:58:47,815 INFO [decode.py:495] About to create model
2022-05-24 20:58:48,313 INFO [decode.py:523] averaging ['pruned_transducer_stateless2/exp-context-size-2/epoch-15.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-16.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-17.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-18.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-19.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-20.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-21.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-22.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-23.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-24.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-25.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-26.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-27.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-28.pt', 'pruned_transducer_stateless2/exp-context-size-2/epoch-29.pt']
2022-05-24 20:59:31,302 INFO [decode.py:537] Number of model parameters: 78648040
2022-05-24 20:59:31,302 INFO [asr_datamodule.py:422] About to get test-clean cuts
2022-05-24 20:59:31,546 INFO [asr_datamodule.py:427] About to get test-other cuts
2022-05-24 20:59:33,352 INFO [decode.py:391] batch 0/?, cuts processed until now is 123
2022-05-24 20:59:50,533 INFO [decode.py:408] The transcripts are stored in pruned_transducer_stateless2/exp-context-size-2/greedy_search/recogs-test-clean-greedy_search-epoch-29-avg-15-context-2-max-sym-per-frame-1.txt
2022-05-24 20:59:50,616 INFO [utils.py:406] [test-clean-greedy_search] %WER 6.92% [3636 / 52576, 387 ins, 359 del, 2890 sub ]
2022-05-24 20:59:50,838 INFO [decode.py:421] Wrote detailed error stats to pruned_transducer_stateless2/exp-context-size-2/greedy_search/errs-test-clean-greedy_search-epoch-29-avg-15-context-2-max-sym-per-frame-1.txt
2022-05-24 20:59:50,838 INFO [decode.py:438]
For test-clean, WER of different settings are:
greedy_search   6.92    best for test-clean

2022-05-24 20:59:51,722 INFO [decode.py:391] batch 0/?, cuts processed until now is 138
2022-05-24 21:00:07,752 INFO [decode.py:408] The transcripts are stored in pruned_transducer_stateless2/exp-context-size-2/greedy_search/recogs-test-other-greedy_search-epoch-29-avg-15-context-2-max-sym-per-frame-1.txt
2022-05-24 21:00:07,861 INFO [utils.py:406] [test-other-greedy_search] %WER 17.91% [9373 / 52343, 884 ins, 1128 del, 7361 sub ]
2022-05-24 21:00:08,114 INFO [decode.py:421] Wrote detailed error stats to pruned_transducer_stateless2/exp-context-size-2/greedy_search/errs-test-other-greedy_search-epoch-29-avg-15-context-2-max-sym-per-frame-1.txt
2022-05-24 21:00:08,115 INFO [decode.py:438]
For test-other, WER of different settings are:
greedy_search   17.91   best for test-other

2022-05-24 21:00:08,115 INFO [decode.py:565] Done!
csukuangfj commented 2 years ago

Can you try other decoding methods, i.e,, modified beam search and fast beam search?

luomingshuang commented 2 years ago

Ok, I have added them to the above table.

luomingshuang commented 2 years ago
There are some decoding results with context-size=4 (pruned rnnt2) trained with full librispeech data: context-size decoding-method epoch avg test-clean test-other
4 greedy_search 28 12 2.65 6.20
4 modified_beam_search 28 12 2.58 6.06
4 fast_beam_search 28 12 2.64 6.15
While our published results with context-size=2 (pruned rnnt2) are https://github1s.com/k2-fsa/icefall/blob/HEAD/egs/librispeech/ASR/RESULTS.md#L422-L431: test-clean test-other comment
greedy search (max sym per frame 1) 2.62 6.37 --epoch 25 --avg 8 --max-duration 600
fast beam search 2.61 6.17 --epoch 25 --avg 8 --max-duration 600 --decoding-method fast_beam_search
modified beam search 2.59 6.19 --epoch 25 --avg 8 --max-duration 600 --decoding-method modified_beam_search
greedy search (max sym per frame 1) 2.70 6.04 --epoch 34 --avg 10 --max-duration 600
fast beam search 2.66 6.00 --epoch 34 --avg 10 --max-duration 600 --decoding-method fast_beam_search
greedy search (max sym per frame 1) 2.62 6.03 --epoch 38 --avg 10 --max-duration 600
fast beam search 2.57 5.95 --epoch 38 --avg 10 --max-duration 600 --decoding-method fast_beam_search

According to the above results, it seems that the context-size=4 performs better than context-size=2 in test-other.

danpovey commented 2 years ago

Cool, thanks! It will be interesting to see what context_size=3 is like. I hope larger context size does not make it harder to recognize out-of-domain data though.

luomingshuang commented 2 years ago

OK. I will do some experiments based on context-size=3.