k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
951 stars 300 forks source link

Using a BTC/OTC in the training Zipformer instead of Conformer. #1589

Closed kerolos closed 7 months ago

kerolos commented 7 months ago

Is there any script available to train the latest good model Zipformer model using Bypass Temporal Classification(BTC)/Omni-temporal Classification (OTC) (https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/WSASR) to align speech with text instead of CTC (https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer_ctc) ?

JinZr commented 7 months ago

sorry, this is not planned so far.

Best Regards Jin

On Thu, 11 Apr 2024 at 18:31 Kerolos ghobrial @.***> wrote:

Is there any script available to train the latest good model Zipformer model using Bypass Temporal Classification(BTC)/Omni-temporal Classification (OTC) ( https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/WSASR) to align speech with text instead of CTC ( https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer_ctc) ?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/1589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42F76HJTCQFPNRJQD33Y4ZQ7BAVCNFSM6AAAAABGCBFVYKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTONBQHA3TEMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

danpovey commented 7 months ago

@DongjiGao any interest?

kerolos commented 7 months ago

Thanks, Dr. Daniel Povey, JinZr, Hello @DongjiGao,

1- Is there any script to create lang OTC based on Phone lexicon instead of BPE (k2-fsa/icefall/tree/master/egs/librispeech/WSASR/local/prepare_otc_lang_bpe.py), since the paper showed a better performance in the Phone Based Lexicon over (Bypass Temporal Classification)? 2- Could this Method be used to clean up the errors present in the training transcripts, something similar to original Kaldi (/egs/wsj/s5/steps/cleanup/clean_and_segment_data.sh or clean_and_segment_data_nnet3.sh ) ? 3- Is there any plan to extend that to be used in ONNX (decoder with int8) ?

Thanks in advance, Kerolos

DongjiGao commented 7 months ago

Thanks, Dr. Daniel Povey, JinZr, Hello @DongjiGao,

1- Is there any script to create lang OTC based on Phone lexicon instead of BPE (k2-fsa/icefall/tree/master/egs/librispeech/WSASR/local/prepare_otc_lang_bpe.py), since the paper showed a better performance in the Phone Based Lexicon over (Bypass Temporal Classification)? 2- Could this Method be used to clean up the errors present in the training transcripts, something similar to original Kaldi (/egs/wsj/s5/steps/cleanup/clean_and_segment_data.sh or clean_and_segment_data_nnet3.sh ) ? 3- Is there any plan to extend that to be used in ONNX (decoder with int8) ?

Thanks in advance, Kerolos

Thank you for your interest.

  1. I can check the phone-based script if needed. We modified how we model the star token in OTC (as the average probability of all non-blank tokens instead of an individual token), so I would guess the performance would be similar to using BPE.
  2. Yes, please refer to this script for more details. It does "flexible alignment" by replacing suspicious tokens or wrongly inserted tokens with a star, and by placing a star where a word is missing.
  3. We do not have such a plan in the near term. Do you have any suggestions?

Dongji

kerolos commented 7 months ago

Thanks for the quick response, @DongjiGao 1- It would be great if you could check the file in prepare_otc_lang.py (using the phone based lexicon ) and any related files that required to complete the training. I would really appreciate that. 2-It seems quite helpful. I'll definitely give it a try. 3- Perhaps my focusing on utilizing Zipformer could be a primary consideration, and then exporting it to ONNX. BR, Kerolos

kerolos commented 7 months ago

Sorry @DongjiGao for bothering you again: 1- I would like to use Phone Based lexicon script in OTC, and then compare the cleaning from this method VS original Kaldi method (clean_and_segment_data_nnet3.sh), which i had used phone based. Thanks in advance, Dongji Gao.

DongjiGao commented 7 months ago

Sorry @DongjiGao for bothering you again: 1- I would like to use Phone Based lexicon script in OTC, and then compare the cleaning from this method VS original Kaldi method (clean_and_segment_data_nnet3.sh), which i had used phone based. Thanks in advance, Dongji Gao.

I will submit a PR by the end of this week.

kerolos commented 6 months ago

Hello @DongjiGao , I got a large WER (1best decoding) when I used phone based lexicon, feature_type=ssl and FP16 : Moreover, in the results it seems that; it is escaping or swallowing some words when decoding, for example: a) 1688-142285-0006-2603: ref=['I', "DON'T", 'THINK', 'MISTER', 'HALE', 'YOU', 'HAVE', 'DONE', 'QUITE', 'RIGHT', 'IN', 'INTRODUCING', 'SUCH', 'A', 'PERSON', 'TO', 'US', 'WITHOUT', 'TELLING', 'US', 'WHAT', 'HE', 'HAD', 'BEEN'] 1688-142285-0006-2603: hyp=['I', "DON'T", 'THINK', 'YOU', 'HAVE', 'DONE', 'WHAT', 'HE', 'HAD', 'BEEN'] b) 1688-142285-0008-2605: ref=['HIS', 'FATHER', 'DYING', 'IN', 'MISERABLE', 'CIRCUMSTANCES'] 1688-142285-0008-2605: hyp=['HIS', 'MISERABLE', 'CIRCUMSTANCES']

Phone results: [decode_phone.py:473] {'subsampling_factor': 4, 'feature_dim': 768, 'nhead': 8, 'dim_feedforward': 2048, 'encoder_dim': 512, 'num_encoder_layers': 12, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.24.0.dev+git.4f014b1.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'first_run', 'icefall-git-sha1': 'c45e9fec-dirty', 'icefall-git-date': 'Wed Apr 3 05:26:24 2024', 'icefall-path': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/icefall', 'k2-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'Hyrican-3', 'IP address': '127.0.1.1'}, 'otc_token': '', 'blank_bias': -4.0, 'epoch': 20, 'iter': 0, 'avg': 5, 'method': '1best', 'use_averaged_model': False, 'num_decoder_layers': 0, 'exp_dir': PosixPath('conformer_ctc2/exp_phone'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm/G_3_gram.fst.txt'), 'full_libri': False, 'mini_libri': False, 'manifest_dir': PosixPath('data/ssl'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'librispeech_cuts_train-clean-100.jsonl.gz'}

4-05-10 13:32:00,131 INFO [decode_phone.py:410] batch 0/?, cuts processed until now is 14 2024-05-10 13:32:16,706 INFO [decode_phone.py:410] batch 100/?, cuts processed until now is 2224 2024-05-10 13:32:18,942 INFO [decode_phone.py:430] The transcripts are stored in conformer_ctc2/exp_phone/recogs-test-clean-no_rescore.txt 2024-05-10 13:32:18,994 INFO [utils.py:656] [test-clean-no_rescore] %WER 74.22% [39021 / 52576, 6 ins, 37768 del, 1247 sub ] 2024-05-10 13:32:19,130 INFO [decode_phone.py:442] Wrote detailed error stats to conformer_ctc2/exp_phone/errs-test-clean-no_rescore.txt 2024-05-10 13:32:19,133 INFO [decode_phone.py:456] For test-clean, WER of different settings are: no_rescore 74.22 best for test-clean

2024-05-10 13:32:19,673 INFO [decode_phone.py:410] batch 0/?, cuts processed until now is 18 2024-05-10 13:32:36,774 INFO [decode_phone.py:410] batch 100/?, cuts processed until now is 2612 2024-05-10 13:32:38,820 INFO [decode_phone.py:430] The transcripts are stored in conformer_ctc2/exp_phone/recogs-test-other-no_rescore.txt 2024-05-10 13:32:38,874 INFO [utils.py:656] [test-other-no_rescore] %WER 79.77% [41756 / 52343, 8 ins, 40086 del, 1662 sub ] 2024-05-10 13:32:39,019 INFO [decode_phone.py:442] Wrote detailed error stats to conformer_ctc2/exp_phone/errs-test-other-no_rescore.txt 2024-05-10 13:32:39,024 INFO [decode_phone.py:456] For test-other, WER of different settings are: no_rescore 79.77 best for test-other

BPE results: {'subsampling_factor': 2, 'feature_dim': 768, 'nhead': 8, 'dim_feedforward': 2048, 'encoder_dim': 512, 'num_encoder_layers': 12, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.23.0.dev+git.1c2a1b5.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'first_run', 'icefall-git-sha1': 'c45e9fec-dirty', 'icefall-git-date': 'Wed Apr 3 05:26:24 2024', 'icefall-path': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/icefall', 'k2-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'hyrican-1', 'IP address': '127.0.1.1'}, 'otc_token': '▁', 'blank_bias': -4.0, 'epoch': 20, 'iter': 0, 'avg': 1, 'method': '1best', 'use_averaged_model': False, 'num_decoder_layers': 0, 'exp_dir': PosixPath('conformer_ctc2/exp'), 'lang_dir': PosixPath('data/lang_bpe_200'), 'lm_dir': PosixPath('data/lm/G_3_gram.fst.txt'), 'full_libri': False, 'mini_libri': False, 'manifest_dir': PosixPath('data/ssl'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'librispeech_cuts_train-clean-100.jsonl.gz'}

2024-04-14 14:21:47,136 INFO [decode.py:476] batch 0/?, cuts processed until now is 14 2024-04-14 14:22:18,915 INFO [decode.py:476] batch 100/?, cuts processed until now is 2224 2024-04-14 14:22:23,112 INFO [decode.py:496] The transcripts are stored in conformer_ctc2/exp/recogs-test-clean-no_rescore.txt 2024-04-14 14:22:23,178 INFO [utils.py:656] [test-clean-no_rescore] %WER 7.30% [3838 / 52576, 370 ins, 599 del, 2869 sub ] 2024-04-14 14:22:23,321 INFO [decode.py:508] Wrote detailed error stats to conformer_ctc2/exp/errs-test-clean-no_rescore.txt 2024-04-14 14:22:23,325 INFO [decode.py:522] For test-clean, WER of different settings are: no_rescore 7.3 best for test-clean

2024-04-14 14:22:23,962 INFO [decode.py:476] batch 0/?, cuts processed until now is 18 2024-04-14 14:22:57,948 INFO [decode.py:476] batch 100/?, cuts processed until now is 2612 2024-04-14 14:23:02,000 INFO [decode.py:496] The transcripts are stored in conformer_ctc2/exp/recogs-test-other-no_rescore.txt 2024-04-14 14:23:02,063 INFO [utils.py:656] [test-other-no_rescore] %WER 17.93% [9385 / 52343, 741 ins, 1813 del, 6831 sub ] 2024-04-14 14:23:02,200 INFO [decode.py:508] Wrote detailed error stats to conformer_ctc2/exp/errs-test-other-no_rescore.txt 2024-04-14 14:23:02,207 INFO [decode.py:522] For test-other, WER of different settings are: no_rescore 17.93 best for test-other

DongjiGao commented 6 months ago

Hello @DongjiGao , I got a large WER (1best decoding) when I used phone based lexicon, feature_type=ssl and FP16 : Moreover, in the results it seems that; it is escaping or swallowing some words when decoding, for example: a) 1688-142285-0006-2603: ref=['I', "DON'T", 'THINK', 'MISTER', 'HALE', 'YOU', 'HAVE', 'DONE', 'QUITE', 'RIGHT', 'IN', 'INTRODUCING', 'SUCH', 'A', 'PERSON', 'TO', 'US', 'WITHOUT', 'TELLING', 'US', 'WHAT', 'HE', 'HAD', 'BEEN'] 1688-142285-0006-2603: hyp=['I', "DON'T", 'THINK', 'YOU', 'HAVE', 'DONE', 'WHAT', 'HE', 'HAD', 'BEEN'] b) 1688-142285-0008-2605: ref=['HIS', 'FATHER', 'DYING', 'IN', 'MISERABLE', 'CIRCUMSTANCES'] 1688-142285-0008-2605: hyp=['HIS', 'MISERABLE', 'CIRCUMSTANCES']

Phone results: [decode_phone.py:473] {'subsampling_factor': 4, 'feature_dim': 768, 'nhead': 8, 'dim_feedforward': 2048, 'encoder_dim': 512, 'num_encoder_layers': 12, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.24.0.dev+git.4f014b1.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'first_run', 'icefall-git-sha1': 'c45e9fec-dirty', 'icefall-git-date': 'Wed Apr 3 05:26:24 2024', 'icefall-path': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/icefall', 'k2-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'Hyrican-3', 'IP address': '127.0.1.1'}, 'otc_token': '', 'blank_bias': -4.0, 'epoch': 20, 'iter': 0, 'avg': 5, 'method': '1best', 'use_averaged_model': False, 'num_decoder_layers': 0, 'exp_dir': PosixPath('conformer_ctc2/exp_phone'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm/G_3_gram.fst.txt'), 'full_libri': False, 'mini_libri': False, 'manifest_dir': PosixPath('data/ssl'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'librispeech_cuts_train-clean-100.jsonl.gz'}

4-05-10 13:32:00,131 INFO [decode_phone.py:410] batch 0/?, cuts processed until now is 14 2024-05-10 13:32:16,706 INFO [decode_phone.py:410] batch 100/?, cuts processed until now is 2224 2024-05-10 13:32:18,942 INFO [decode_phone.py:430] The transcripts are stored in conformer_ctc2/exp_phone/recogs-test-clean-no_rescore.txt 2024-05-10 13:32:18,994 INFO [utils.py:656] [test-clean-no_rescore] %WER 74.22% [39021 / 52576, 6 ins, 37768 del, 1247 sub ] 2024-05-10 13:32:19,130 INFO [decode_phone.py:442] Wrote detailed error stats to conformer_ctc2/exp_phone/errs-test-clean-no_rescore.txt 2024-05-10 13:32:19,133 INFO [decode_phone.py:456] For test-clean, WER of different settings are: no_rescore 74.22 best for test-clean

2024-05-10 13:32:19,673 INFO [decode_phone.py:410] batch 0/?, cuts processed until now is 18 2024-05-10 13:32:36,774 INFO [decode_phone.py:410] batch 100/?, cuts processed until now is 2612 2024-05-10 13:32:38,820 INFO [decode_phone.py:430] The transcripts are stored in conformer_ctc2/exp_phone/recogs-test-other-no_rescore.txt 2024-05-10 13:32:38,874 INFO [utils.py:656] [test-other-no_rescore] %WER 79.77% [41756 / 52343, 8 ins, 40086 del, 1662 sub ] 2024-05-10 13:32:39,019 INFO [decode_phone.py:442] Wrote detailed error stats to conformer_ctc2/exp_phone/errs-test-other-no_rescore.txt 2024-05-10 13:32:39,024 INFO [decode_phone.py:456] For test-other, WER of different settings are: no_rescore 79.77 best for test-other

BPE results: {'subsampling_factor': 2, 'feature_dim': 768, 'nhead': 8, 'dim_feedforward': 2048, 'encoder_dim': 512, 'num_encoder_layers': 12, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.23.0.dev+git.1c2a1b5.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'first_run', 'icefall-git-sha1': 'c45e9fec-dirty', 'icefall-git-date': 'Wed Apr 3 05:26:24 2024', 'icefall-path': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/icefall', 'k2-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'hyrican-1', 'IP address': '127.0.1.1'}, 'otc_token': '▁', 'blank_bias': -4.0, 'epoch': 20, 'iter': 0, 'avg': 1, 'method': '1best', 'use_averaged_model': False, 'num_decoder_layers': 0, 'exp_dir': PosixPath('conformer_ctc2/exp'), 'lang_dir': PosixPath('data/lang_bpe_200'), 'lm_dir': PosixPath('data/lm/G_3_gram.fst.txt'), 'full_libri': False, 'mini_libri': False, 'manifest_dir': PosixPath('data/ssl'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'librispeech_cuts_train-clean-100.jsonl.gz'}

2024-04-14 14:21:47,136 INFO [decode.py:476] batch 0/?, cuts processed until now is 14 2024-04-14 14:22:18,915 INFO [decode.py:476] batch 100/?, cuts processed until now is 2224 2024-04-14 14:22:23,112 INFO [decode.py:496] The transcripts are stored in conformer_ctc2/exp/recogs-test-clean-no_rescore.txt 2024-04-14 14:22:23,178 INFO [utils.py:656] [test-clean-no_rescore] %WER 7.30% [3838 / 52576, 370 ins, 599 del, 2869 sub ] 2024-04-14 14:22:23,321 INFO [decode.py:508] Wrote detailed error stats to conformer_ctc2/exp/errs-test-clean-no_rescore.txt 2024-04-14 14:22:23,325 INFO [decode.py:522] For test-clean, WER of different settings are: no_rescore 7.3 best for test-clean

2024-04-14 14:22:23,962 INFO [decode.py:476] batch 0/?, cuts processed until now is 18 2024-04-14 14:22:57,948 INFO [decode.py:476] batch 100/?, cuts processed until now is 2612 2024-04-14 14:23:02,000 INFO [decode.py:496] The transcripts are stored in conformer_ctc2/exp/recogs-test-other-no_rescore.txt 2024-04-14 14:23:02,063 INFO [utils.py:656] [test-other-no_rescore] %WER 17.93% [9385 / 52343, 741 ins, 1813 del, 6831 sub ] 2024-04-14 14:23:02,200 INFO [decode.py:508] Wrote detailed error stats to conformer_ctc2/exp/errs-test-other-no_rescore.txt 2024-04-14 14:23:02,207 INFO [decode.py:522] For test-other, WER of different settings are: no_rescore 17.93 best for test-other

  • Is there anything in the parameters training or decoding shall I change to obtain a close result as BPE ?
  • have you faced a similar situation in the phone based lexicon, in which the system swallows some words ?

Please use subsampling_factor = 2 for SSL features.

kerolos commented 6 months ago

Thanks @DongjiGao for your support, and sorry for bothering you again. I have changed the subsampling_factor = 2 for SSL features in training and decoding. By changing that, the total loss and otc became very close to bpe (loss[otc_loss], tot_loss[otc_loss]). However, The WER still very high it drops from 74.22 to 65.83% [34612 / 52576, 18 ins, 32137 del, 2457 sub ] for test-clean, and from 79.77 to 74.65% [39075 / 52343, 27 ins, 35505 del, 3543 sub ] for test-other. The deletion is still very high.

-The parameters used for decoding based phone lexicon: {'subsampling_factor': 2, 'feature_dim': 768, 'nhead': 8, 'dim_feedforward': 2048, 'encoder_dim': 512, 'num_encoder_layers': 12, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.24.0.dev+git.4f014b1.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'first_run', 'icefall-git-sha1': 'c45e9fec-dirty', 'icefall-git-date': 'Wed Apr 3 05:26:24 2024', 'icefall-path': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/icefall', 'k2-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'Hyrican-3', 'IP address': '127.0.1.1'}, 'otc_token': '', 'blank_bias': -4.0, 'epoch': 20, 'iter': 0, 'avg': 5, 'method': '1best', 'use_averaged_model': False, 'num_decoder_layers': 0, 'exp_dir': PosixPath('conformer_ctc2/exp_phone'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm/G_3_gram.fst.txt'), 'full_libri': False, 'mini_libri': False, 'manifest_dir': PosixPath('data/ssl'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'librispeech_cuts_train-clean-100.jsonl.gz'}

The training loss in Tensorboard (BPE based lexicon White -VS- Phone based lexicon Black):

Tensorboard

tobygodwin commented 5 days ago

Thanks @DongjiGao for your support, and sorry for bothering you again. I have changed the subsampling_factor = 2 for SSL features in training and decoding. By changing that, the total loss and otc became very close to bpe (loss[otc_loss], tot_loss[otc_loss]). However, The WER still very high it drops from 74.22 to 65.83% [34612 / 52576, 18 ins, 32137 del, 2457 sub ] for test-clean, and from 79.77 to 74.65% [39075 / 52343, 27 ins, 35505 del, 3543 sub ] for test-other. The deletion is still very high.

-The parameters used for decoding based phone lexicon: {'subsampling_factor': 2, 'feature_dim': 768, 'nhead': 8, 'dim_feedforward': 2048, 'encoder_dim': 512, 'num_encoder_layers': 12, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.24.0.dev+git.4f014b1.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'first_run', 'icefall-git-sha1': 'c45e9fec-dirty', 'icefall-git-date': 'Wed Apr 3 05:26:24 2024', 'icefall-path': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/icefall', 'k2-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'Hyrican-3', 'IP address': '127.0.1.1'}, 'otc_token': '', 'blank_bias': -4.0, 'epoch': 20, 'iter': 0, 'avg': 5, 'method': '1best', 'use_averaged_model': False, 'num_decoder_layers': 0, 'exp_dir': PosixPath('conformer_ctc2/exp_phone'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm/G_3_gram.fst.txt'), 'full_libri': False, 'mini_libri': False, 'manifest_dir': PosixPath('data/ssl'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'librispeech_cuts_train-clean-100.jsonl.gz'}

The training loss in Tensorboard (BPE based lexicon White -VS- Phone based lexicon Black):

Tensorboard

Hi @kerolos,

Sorry to dredge up this old comment, but I am seeing similar issues with high deletion rates when training with OTC. Did you get to the bottom of your issue?

DongjiGao commented 2 days ago

@tobygodwin Can you share more details?

DongjiGao commented 2 days ago

Thanks @DongjiGao for your support, and sorry for bothering you again. I have changed the subsampling_factor = 2 for SSL features in training and decoding. By changing that, the total loss and otc became very close to bpe (loss[otc_loss], tot_loss[otc_loss]). However, The WER still very high it drops from 74.22 to 65.83% [34612 / 52576, 18 ins, 32137 del, 2457 sub ] for test-clean, and from 79.77 to 74.65% [39075 / 52343, 27 ins, 35505 del, 3543 sub ] for test-other. The deletion is still very high.

-The parameters used for decoding based phone lexicon: {'subsampling_factor': 2, 'feature_dim': 768, 'nhead': 8, 'dim_feedforward': 2048, 'encoder_dim': 512, 'num_encoder_layers': 12, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.24.0.dev+git.4f014b1.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': True, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'first_run', 'icefall-git-sha1': 'c45e9fec-dirty', 'icefall-git-date': 'Wed Apr 3 05:26:24 2024', 'icefall-path': '/mnt/srv/data/train_am/analysisTD/icefall_kaldi/icefall', 'k2-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/ghk/miniconda3/envs/icefall-run/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'Hyrican-3', 'IP address': '127.0.1.1'}, 'otc_token': '', 'blank_bias': -4.0, 'epoch': 20, 'iter': 0, 'avg': 5, 'method': '1best', 'use_averaged_model': False, 'num_decoder_layers': 0, 'exp_dir': PosixPath('conformer_ctc2/exp_phone'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm/G_3_gram.fst.txt'), 'full_libri': False, 'mini_libri': False, 'manifest_dir': PosixPath('data/ssl'), 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'input_strategy': 'PrecomputedFeatures', 'train_manifest': 'librispeech_cuts_train-clean-100.jsonl.gz'}

The training loss in Tensorboard (BPE based lexicon White -VS- Phone based lexicon Black):

Tensorboard

Hi @kerolos,

Can you try different 'blank_bias' during decoding (e.g., -3, -2)? It looks like the current value (-4) is too small.

Dongji