k2-fsa / icefall

https://k2-fsa.github.io/icefall/

Apache License 2.0

802 stars 270 forks source link

CUDA out of memory in decoding #70

Open Lzhang-hub opened 2 years ago

Lzhang-hub commented 2 years ago

Hi, I am newer to learn icefall,I finished the training of tdnn_lstm_ctc, when run the decode steps, I meet the following error, I change the --max-duration, there are still errors:

we set --max-duration=100 and use Tesla V100-SXM, the GPU info follow:

would you give me some advice？thanks

CSerV commented 2 years ago

The 100 is still big for max-duration. Maybe you can reduce the max-duration to 50, 30 or even less.

Lzhang-hub commented 2 years ago

The 100 is still big for max-duration. Maybe you can reduce the max-duration to 50, 30 or even less.

I have reduce max-duration to 1 ,but the error still exist.

Lzhang-hub commented 2 years ago

@csukuangfj We have use you advices (1) and (3) ,but the problem is not solved. If you can give some other advices, thank you very much! 企业微信截图_16336894139504

danpovey commented 2 years ago

You could also mess with the decoding parameters, e.g. reduce the max-active and/or the beam.

On Fri, Oct 8, 2021 at 6:41 PM Lzhang-hub @.***> wrote:

@csukuangfj https://github.com/csukuangfj We have use you advices (1) and (3) ,but the problem is not solved. If you can give some other advices, thank you very much! [image: 企业微信截图_16336894139504] https://user-images.githubusercontent.com/57925599/136542593-f8c0f1e1-2bc4-44a7-9fa8-bdeb4077d9ea.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-938538727, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6XVD6EXTWOT4X2TSDUF3DDZANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

csukuangfj commented 2 years ago

https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135

You can reduce search_beam, output_beam, or max_active_states.

By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning?

Lzhang-hub commented 2 years ago

https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135

You can reduce search_beam, output_beam, or max_active_states.

By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning?

Thanks! I will attempt to decode with your advices.

CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool.

danpovey commented 2 years ago

What do you mean by very poor? Is this your own data, or Librispeech?

The model quality and data quality can affect the memory used in decoding.

On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.***> wrote:

https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135

You can reduce search_beam, output_beam, or max_active_states.

By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning?

Thanks! I will attempt to decode with your advices.

CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-939201372, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cdxie commented 2 years ago

What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.***> wrote: https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135 You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

@danpovey @csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs： ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch)

2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch)

2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) #########################

we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the https://github.com/kaldi-asr/kaldi/pull/4594 to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks

danpovey commented 2 years ago

Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK.

On Sat, Oct 9, 2021 at 11:21 AM cdxie @.***> wrote:

What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m4937422915890188941> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.***> wrote: https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135 You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 (comment) https://github.com/k2-fsa/icefall/issues/70#issuecomment-939201372>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

@danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs： ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch)

2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch)

2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) #########################

we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the https://github.com/kaldi-asr/kaldi/pull/4594 http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-939215235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cdxie commented 2 years ago

Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m4937422915890188941> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.> wrote: https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135 You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 (comment) <#70 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs： ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

@danpovey OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt

danpovey commented 2 years ago

Your model did not converge; loss should be something like 0.005, not 0.5. I believe when we ran it, we used --bucketing-sampler=True, that could possibly be the reason. Also we used several GPUs, but that should not really affect convergence I think. (Normally this script converges OK).

On Sat, Oct 9, 2021 at 2:17 PM cdxie @.***> wrote:

Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … <#m3115122914114341433> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m4937422915890188941> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.> wrote: https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135 You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 https://github.com/k2-fsa/icefall/issues/70 (comment) <#70 (comment) https://github.com/k2-fsa/icefall/issues/70#issuecomment-939201372>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs： ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 https://github.com/kaldi-asr/kaldi/pull/4594 http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment) https://github.com/k2-fsa/icefall/issues/70#issuecomment-939215235>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-939237157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cdxie commented 2 years ago

Your model did not converge; loss should be something like 0.005, not 0.5. I believe when we ran it, we used --bucketing-sampler=True, that could possibly be the reason. Also we used several GPUs, but that should not really affect convergence I think. (Normally this script converges OK). … On Sat, Oct 9, 2021 at 2:17 PM cdxie @.**> wrote: Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … <#m3115122914114341433> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m4937422915890188941> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.*> wrote: https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135 You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs： ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594> http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment) <#70 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Thanks, I will modify the parameters and run again.

The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are not modified

danpovey commented 2 years ago

And please show us some sample decoding output, it is written to somewhere (aligned output vs. the ref text). I want to see how the model failed. To get 59% WER is unusual; would normally be either 100% or close to 0, I'd expect.

On Sat, Oct 9, 2021 at 4:36 PM cdxie @.***> wrote:

Your model did not converge; loss should be something like 0.005, not 0.5. I believe when we ran it, we used --bucketing-sampler=True, that could possibly be the reason. Also we used several GPUs, but that should not really affect convergence I think. (Normally this script converges OK). … <#m7389038400197059205> On Sat, Oct 9, 2021 at 2:17 PM cdxie @.*> wrote: Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … <#m3115122914114341433> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m4937422915890188941> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.> wrote: https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135 You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 https://github.com/k2-fsa/icefall/issues/70 <#70 https://github.com/k2-fsa/icefall/issues/70> (comment) <#70 https://github.com/k2-fsa/icefall/issues/70 (comment) <#70 (comment) https://github.com/k2-fsa/icefall/issues/70#issuecomment-939201372>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey https://github.com/danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs： ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 https://github.com/kaldi-asr/kaldi/pull/4594 < kaldi-asr/kaldi#4594 https://github.com/kaldi-asr/kaldi/pull/4594> http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 https://github.com/k2-fsa/icefall/issues/70 (comment) <#70 (comment) https://github.com/k2-fsa/icefall/issues/70#issuecomment-939215235>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment) https://github.com/k2-fsa/icefall/issues/70#issuecomment-939237157>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

Thanks, I will modify the parameters and run again.

The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are not modified

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-939257061, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7PDH2BU52NMFTXSL3UF75JNANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cdxie commented 2 years ago

And please show us some sample decoding output, it is written to somewhere (aligned output vs. the ref text). I want to see how the model failed. To get 59% WER is unusual; would normally be either 100% or close to 0, I'd expect. … On Sat, Oct 9, 2021 at 4:36 PM cdxie @.*> wrote: Your model did not converge; loss should be something like 0.005, not 0.5. I believe when we ran it, we used --bucketing-sampler=True, that could possibly be the reason. Also we used several GPUs, but that should not really affect convergence I think. (Normally this script converges OK). … <#m7389038400197059205> On Sat, Oct 9, 2021 at 2:17 PM cdxie @.> wrote: Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … <#m3115122914114341433> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m4937422915890188941> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.*> wrote: https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135 You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 <#70> <#70 <#70>> (comment) <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey https://github.com/danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs： ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594> < kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594>> http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment) <#70 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . Thanks, I will modify the parameters and run again. The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are not modified — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7PDH2BU52NMFTXSL3UF75JNANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

OK, I choose the best results file(lm_scale_0.7) of tdnn-lstm-ctc model, the decoding parameters { 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000 } :

errs-test-clean-lm_scale_0.7.txt errs-test-other-lm_scale_0.7.txt recogs-test-clean-lm_scale_0.7.txt recogs-test-other-lm_scale_0.7.txt wer-summary-test-clean.txt wer-summary-test-other.txt

cdxie commented 2 years ago

And please show us some sample decoding output, it is written to somewhere (aligned output vs. the ref text). I want to see how the model failed. To get 59% WER is unusual; would normally be either 100% or close to 0, I'd expect. … On Sat, Oct 9, 2021 at 4:36 PM cdxie @.*> wrote: Your model did not converge; loss should be something like 0.005, not 0.5. I believe when we ran it, we used --bucketing-sampler=True, that could possibly be the reason. Also we used several GPUs, but that should not really affect convergence I think. (Normally this script converges OK). … <#m7389038400197059205> On Sat, Oct 9, 2021 at 2:17 PM cdxie @.> wrote: Hm, can you show the last part of one of the training logs or point to the tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder whether the model is OK. … <#m3115122914114341433> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.> wrote: What do you mean by very poor? Is this your own data, or Librispeech? The model quality and data quality can affect the memory used in decoding. … <#m4937422915890188941> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.*> wrote: https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135 You can reduce search_beam, output_beam, or max_active_states. By the way, does CUDA out of memory abort your decoding process? Does it continue to decode after pruning? Thanks! I will attempt to decode with your advices. CUDA out of memory do not abort my decoding process, the decode can be done, but the results are very pool. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#70 <#70> <#70 <#70>> (comment) <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . @danpovey https://github.com/danpovey https://github.com/danpovey https://github.com/danpovey @csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj advices (1) and (3), the above errors still occurs： ############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir': PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3, 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg': 5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt'] 2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129 2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by PyTorch) ######################### we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594> < kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594>> http://url to prune your G. We now can't pinpoint the cause of the error, so we need help,thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 <#70> (comment) <#70 (comment) <#70 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . OK, this is the training log of tdnn_lstm_ctc : tdnn-lstm-ctc-log-train.txt https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment) <#70 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . Thanks, I will modify the parameters and run again. The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are not modified — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7PDH2BU52NMFTXSL3UF75JNANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

@danpovey , I also run the recipes of Librispeech of Confromer CTC, using a100 NVIDIA GPU-40G, and single-GPU single-machine, and no parameters are modified during the training, my model may be not converge according your opinions. Now what should need to change that can make the loss converged?

the last part of the training logs: ########### 2021-10-02 04:08:09,315 INFO [train.py:506] Epoch 34, batch 53make the loss converge the last part of the Confromer CTC, training logs: ########### 2021-10-02 04:08:09,315 INFO [train.py:506] Epoch 34, batch 53900, batch avg ctc loss 0.0350, batch avg att loss 0.0249, batch avg loss 0.0279, total avg ctc loss: 0.0481, total avg att loss: 0.03900, batch avg ctc loss 0.0350, batch avg att loss 0.0249, batch avg loss 0.0279, total avg ctc loss: 0.0481, total avg att loss: 0.0344, total avg loss: 0.0385, batch size: 15 2021-10-02 04:08:14,123 INFO [train.py:506] Epoch 34, batch 53910, batch avg ctc loss 0.0332, batch avg att loss 0.0225, batch avg loss 0.0257, total avg ctc loss: 0.0476, total avg att loss: 0.0339, total avg loss: 0.0380, batch size: 17 2021-10-02 04:08:19,109 INFO [train.py:506] Epoch 34, batch 53920, batch avg ctc loss 0.0688, batch avg att loss 0.0451, batch avg loss 0.0522, total avg ctc loss: 0.0472, total avg att loss: 0.0334, total avg loss: 0.0375, batch size: 18 2021-10-02 04:08:24,021 INFO [train.py:506] Epoch 34, batch 53930, batch avg ctc loss 0.0530, batch avg att loss 0.0363, batch avg loss 0.0413, total avg ctc loss: 0.0476, total avg att loss: 0.0335, total avg loss: 0.0377, batch size: 15 2021-10-02 04:08:29,207 INFO [train.py:506] Epoch 34, batch 53940, batch avg ctc loss 0.0420, batch avg att loss 0.0304, batch avg loss 0.0339, total avg ctc loss: 0.0473, total avg att loss: 0.0339, total avg loss: 0.0379, batch size: 16 2021-10-02 04:08:34,475 INFO [train.py:506] Epoch 34, batch 53950, batch avg ctc loss 0.0518, batch avg att loss 0.0357, batch avg loss 0.0405, total avg ctc loss: 0.0469, total avg att loss: 0.0335, total avg loss: 0.0376, batch size: 16 2021-10-02 04:08:39,350 INFO [train.py:506] Epoch 34, batch 53960, batch avg ctc loss 0.0602, batch avg att loss 0.0414, batch avg loss 0.0471, total avg ctc loss: 0.0465, total avg att loss: 0.0333, total avg loss: 0.0373, batch size: 13 2021-10-02 04:08:44,708 INFO [train.py:506] Epoch 34, batch 53970, batch avg ctc loss 0.0495, batch avg att loss 0.0328, batch avg loss 0.0378, total avg ctc loss: 0.0462, total avg att loss: 0.0330, total avg loss: 0.0370, batch size: 16 2021-10-02 04:08:49,894 INFO [train.py:506] Epoch 34, batch 53980, batch avg ctc loss 0.0661, batch avg att loss 0.0431, batch avg loss 0.0500, total avg ctc loss: 0.0465, total avg att loss: 0.0331, total avg loss: 0.0371, batch size: 15 2021-10-02 04:08:54,981 INFO [train.py:506] Epoch 34, batch 53990, batch avg ctc loss 0.0351, batch avg att loss 0.0310, batch avg loss 0.0323, total avg ctc loss: 0.0469, total avg att loss: 0.0336, total avg loss: 0.0376, batch size: 17 2021-10-02 04:09:01,103 INFO [train.py:506] Epoch 34, batch 54000, batch avg ctc loss 0.0616, batch avg att loss 0.0432, batch avg loss 0.0487, total avg ctc loss: 0.0466, total avg att loss: 0.0334, total avg loss: 0.0374, batch size: 16 2021-10-02 04:10:01,514 INFO [train.py:565] Epoch 34, valid ctc loss 0.0642,valid att loss 0.0416,valid loss 0.0483, best valid loss: 0.0445 best valid epoch: 22 2021-10-02 04:10:06,173 INFO [train.py:506] Epoch 34, batch 54010, batch avg ctc loss 0.0651, batch avg att loss 0.0448, batch avg loss 0.0509, total avg ctc loss: 0.0551, total avg att loss: 0.0359, total avg loss: 0.0416, batch size: 15 2021-10-02 04:10:12,536 INFO [train.py:506] Epoch 34, batch 54020, batch avg ctc loss 0.0393, batch avg att loss 0.0274, batch avg loss 0.0310, total avg ctc loss: 0.0516, total avg att loss: 0.0342, total avg loss: 0.0394, batch size: 20 2021-10-02 04:10:17,708 INFO [train.py:506] Epoch 34, batch 54030, batch avg ctc loss 0.0668, batch avg att loss 0.0434, batch avg loss 0.0504, total avg ctc loss: 0.0497, total avg att loss: 0.0325, total avg loss: 0.0377, batch size: 15 2021-10-02 04:10:23,342 INFO [train.py:506] Epoch 34, batch 54040, batch avg ctc loss 0.0456, batch avg att loss 0.0283, batch avg loss 0.0335, total avg ctc loss: 0.0484, total avg att loss: 0.0340, total avg loss: 0.0383, batch size: 17 2021-10-02 04:10:28,181 INFO [train.py:506] Epoch 34, batch 54050, batch avg ctc loss 0.0455, batch avg att loss 0.0313, batch avg loss 0.0356, total avg ctc loss: 0.0488, total avg att loss: 0.0339, total avg loss: 0.0384, batch size: 14 2021-10-02 04:10:33,627 INFO [train.py:506] Epoch 34, batch 54060, batch avg ctc loss 0.0549, batch avg att loss 0.1210, batch avg loss 0.1011, total avg ctc loss: 0.0488, total avg att loss: 0.0351, total avg loss: 0.0392, batch size: 18 2021-10-02 04:10:38,395 INFO [train.py:506] Epoch 34, batch 54070, batch avg ctc loss 0.0647, batch avg att loss 0.0357, batch avg loss 0.0444, total avg ctc loss: 0.0486, total avg att loss: 0.0346, total avg loss: 0.0388, batch size: 16 2021-10-02 04:10:43,016 INFO [train.py:506] Epoch 34, batch 54080, batch avg ctc loss 0.0360, batch avg att loss 0.0266, batch avg loss 0.0294, total avg ctc loss: 0.0477, total avg att loss: 0.0338, total avg loss: 0.0380, batch size: 14 2021-10-02 04:10:47,858 INFO [train.py:506] Epoch 34, batch 54090, batch avg ctc loss 0.0496, batch avg att loss 0.0290, batch avg loss 0.0352, total avg ctc loss: 0.0474, total avg att loss: 0.0332, total avg loss: 0.0375, batch size: 15 2021-10-02 04:10:52,855 INFO [train.py:506] Epoch 34, batch 54100, batch avg ctc loss 0.0421, batch avg att loss 0.0288, batch avg loss 0.0328, total avg ctc loss: 0.0477, total avg att loss: 0.0342, total avg loss: 0.0382, batch size: 14 2021-10-02 04:10:53,829 INFO [checkpoint.py:62] Saving checkpoint to conformer_ctc/exp/epoch-34.pt 2021-10-02 04:11:36,298 INFO [train.py:708] Done! #############

cdxie commented 2 years ago

@danpovey @csukuangfj , Ignore the loss convergence problem, the problems of CUDA out of memory in decoding are still not be solved, could you give more help?

danpovey commented 2 years ago

The conformer model logs look normal. Likely the memory usage in decoding is related to the convergence problems of the model. We will rerun the TDNN+LSTM+CTC script locally to make sure there is no problem. @luomingshuang can you do this?

luomingshuang commented 2 years ago

OK, I will do it.

The conformer model logs look normal. Likely the memory usage in decoding is related to the convergence problems of the model. We will rerun the TDNN+LSTM+CTC script locally to make sure there is no problem. @luomingshuang can you do this?

cdxie commented 2 years ago

The conformer model logs look normal. Likely the memory usage in decoding is related to the convergence problems of the model. We will rerun the TDNN+LSTM+CTC script locally to make sure there is no problem. @luomingshuang can you do this?

@danpovey @csukuangfj Sorry to truble again, I just run the decode steps of conformer-ctc, the same mistaked occured again(reduced the search_beam, max_active_states). Is this the same reason as TDNN+LSTM+CTC? or some wrong with our machine（we use docker environment）? : ################ 2021-10-09 20:50:49,123 INFO [decode.py:538] Decoding started 2021-10-09 20:50:49,123 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 13, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 7000, 'use_double_scores': True, 'epoch': 34, 'avg': 20, 'method': 'attention-decoder', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 20:50:49,620 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe/Linv.pt 2021-10-09 20:50:49,795 INFO [decode.py:549] device: cuda 2021-10-09 20:50:57,672 INFO [decode.py:604] Loading pre-compiled G_4_gram.pt 2021-10-09 20:51:08,755 INFO [decode.py:640] averaging ['conformer_ctc/exp/epoch-15.pt', 'conformer_ctc/exp/epoch-16.pt', 'conformer_ctc/exp/epoch-17.pt', 'conformer_ctc/exp/epoch-18.pt', 'conformer_ctc/exp/epoch-19.pt', 'conformer_ctc/exp/epoch-20.pt', 'conformer_ctc/exp/epoch-21.pt', 'conformer_ctc/exp/epoch-22.pt', 'conformer_ctc/exp/epoch-23.pt', 'conformer_ctc/exp/epoch-24.pt', 'conformer_ctc/exp/epoch-25.pt', 'conformer_ctc/exp/epoch-26.pt', 'conformer_ctc/exp/epoch-27.pt', 'conformer_ctc/exp/epoch-28.pt', 'conformer_ctc/exp/epoch-29.pt', 'conformer_ctc/exp/epoch-30.pt', 'conformer_ctc/exp/epoch-31.pt', 'conformer_ctc/exp/epoch-32.pt', 'conformer_ctc/exp/epoch-33.pt', 'conformer_ctc/exp/epoch-34.pt'] 2021-10-09 20:51:27,902 INFO [decode.py:653] Number of model parameters: 116147120 2021-10-09 20:51:30,958 INFO [decode.py:474] batch 0/?, cuts processed until now is 2 2021-10-09 20:51:44,208 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 31.75 GiB total capacity; 26.26 GiB already allocated; 1.46 GiB free; 29.13 GiB reserved in total by PyTorch)

2021-10-09 20:51:44,274 INFO [decode.py:732] num_arcs before pruning: 103742 2021-10-09 20:51:44,288 INFO [decode.py:739] num_arcs after pruning: 45225 2021-10-09 20:51:46,104 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.47 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,104 INFO [decode.py:732] num_arcs before pruning: 233253 2021-10-09 20:51:46,116 INFO [decode.py:739] num_arcs after pruning: 90555 2021-10-09 20:51:46,235 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,235 INFO [decode.py:732] num_arcs before pruning: 90555 2021-10-09 20:51:46,247 INFO [decode.py:739] num_arcs after pruning: 90414 2021-10-09 20:51:46,360 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,360 INFO [decode.py:732] num_arcs before pruning: 90414 2021-10-09 20:51:46,370 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:46,482 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,483 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,492 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:46,605 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,605 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,615 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:46,728 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,728 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,739 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:46,853 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,853 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,864 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:46,978 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:46,978 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,989 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:47,101 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:47,101 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:47,112 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:47,226 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:47,226 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:47,237 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:47,351 INFO [decode.py:731] Caught exception: CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 19.32 GiB already allocated; 2.63 GiB free; 27.96 GiB reserved in total by PyTorch)

2021-10-09 20:51:47,351 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:47,361 INFO [decode.py:739] num_arcs after pruning: 90366 2021-10-09 20:51:47,361 INFO [decode.py:743] Return None as the resulting lattice is too large Traceback (most recent call last): File "./conformer_ctc/decode.py", line 688, in main() File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "./conformer_ctc/decode.py", line 664, in main results_dict = decode_dataset( File "./conformer_ctc/decode.py", line 447, in decode_dataset hyps_dict = decode_one_batch( File "./conformer_ctc/decode.py", line 365, in decode_one_batch best_path_dict = rescore_with_attention_decoder( File "/workspace/icefall/icefall/decode.py", line 812, in rescore_with_attention_decoder nbest = Nbest.from_lattice( File "/workspace/icefall/icefall/decode.py", line 209, in from_lattice saved_scores = lattice.scores.clone() AttributeError: 'NoneType' object has no attribute 'scores' ##################

luomingshuang commented 2 years ago

I suggest that you can use ctc decoding to verify your model according to #71. If the results based on ctc decoding are normal, maybe the problem happens to your language model.

danpovey commented 2 years ago

looks like.it imwas.attention decode that failed, fst decode worked i think

On Saturday, October 9, 2021, Mingshuang Luo @.***> wrote:

I suggest that you can use ctc decoding to verify your model according to

71 https://github.com/k2-fsa/icefall/pull/71. If the results based on

ctc decoding are normal, maybe the problem happens to your language model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-939296079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3E3GZNM7UR4NA2HVTUGA6MDANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Lzhang-hub commented 2 years ago

I suggest that you can use ctc decoding to verify your model according to #71. If the results based on ctc decoding are normal, maybe the problem happens to your language model.

According your advice, we use ctc decoding,but the decoding progress is stuck. The logs are as follows and has been in this state for more than 30 hours. Beside, we tested it on both Tesla V100 and A100-SXM4-40GB, We are not sure whether it is related to the machine configuration, could you please provide your machine configuration?

####### 2021-10-09 23:30:54,225 INFO [decode.py:538] Decoding started 2021-10-09 23:30:54,225 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 25, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_5000'), 'full_libri': False, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 300, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-09 23:30:54,539 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_5000/Linv.pt 2021-10-09 23:30:54,830 INFO [decode.py:549] device: cuda 2021-10-09 23:31:30,576 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-25.pt 2021-10-09 23:31:45,498 INFO [decode.py:653] Number of model parameters: 116147120 ########

csukuangfj commented 2 years ago

I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch 1.7.1

but the decoding progress is stuck.

We have never encountered this issue before.

Could you test the decoding script with a pre-trained model, provided by us (see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model)?

$ cd egs/librispeech/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc
$ cd ..
$ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt

And you can pass --epoch 99 --avg 1 when running ./conformer_ctc/decode.py.

If it still gets stuck, there is a higher chance that there are some problems with your configuration.

danpovey commented 2 years ago

Perhaps nvidia-smi would show something.

On Mon, Oct 11, 2021 at 12:13 PM Fangjun Kuang @.***> wrote:

I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch 1.7.1

but the decoding progress is stuck.

We have never encountered this issue before.

Could you test the decoding script with a pre-trained model, provided by us (see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model )?

$ cd egs/librispeech/ASR $ mkdir tmp $ cd tmp $ git lfs install $ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc $ cd .. $ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt

And you can pass --epoch 99 --avg 1 when running ./conformer_ctc/decode.py .

If it still gets stuck, there is a higher chance that there are some problems with your configuration.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-939665133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO33N6HSGTOBNHCOAQDUGJP7PANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

danpovey commented 2 years ago

you could perhaps debug by doing export CUDA_LAUNCH_BLOCKING=1 gdb --args python3 [program and args] (gdb) r ... and then do ctrl-c when it gets stuck. The backtrace may be useful. More useful if you built k2 in debug mode.

On Mon, Oct 11, 2021 at 12:27 PM Daniel Povey @.***> wrote:

Perhaps nvidia-smi would show something.

On Mon, Oct 11, 2021 at 12:13 PM Fangjun Kuang @.***> wrote:

I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch 1.7.1

but the decoding progress is stuck.

We have never encountered this issue before.

Could you test the decoding script with a pre-trained model, provided by us (see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model )?

$ cd egs/librispeech/ASR $ mkdir tmp $ cd tmp $ git lfs install $ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc $ cd .. $ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt

And you can pass --epoch 99 --avg 1 when running ./conformer_ctc/decode.py.

If it still gets stuck, there is a higher chance that there are some problems with your configuration.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-939665133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO33N6HSGTOBNHCOAQDUGJP7PANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Lzhang-hub commented 2 years ago

I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch 1.7.1

but the decoding progress is stuck.

We have never encountered this issue before.

Could you test the decoding script with a pre-trained model, provided by us (see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model)?
$ cd egs/librispeech/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc
$ cd ..
$ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt
And you can pass --epoch 99 --avg 1 when running ./conformer_ctc/decode.py.

If it still gets stuck, there is a higher chance that there are some problems with your configuration.

I test the decoding script with a pre-trained model,get the follow error:

######### 2021-10-11 16:05:51,556 INFO [decode.py:538] Decoding started 2021-10-11 16:05:51,557 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 99, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_5000'), 'full_libri': False, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': False, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-11 16:05:52,080 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe_5000/Linv.pt 2021-10-11 16:05:52,495 INFO [decode.py:549] device: cuda:0 2021-10-11 16:05:56,562 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-99.pt Traceback (most recent call last): File "./conformer_ctc/decode.py", line 688, in main() File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, kwargs) File "./conformer_ctc/decode.py", line 633, in main load_checkpoint(f"{params.exp_dir}/epoch-{params.epoch}.pt", model) File "/workspace/icefall/icefall/checkpoint.py", line 93, in load_checkpoint checkpoint = torch.load(filename, map_location="cpu") File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 595, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 764, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, 'v'. ########

csukuangfj commented 2 years ago

Please make sure that you have run git lfs install.

Also, you can check the file size of pretrained.pt, which should be 443 MB.

cdxie commented 2 years ago

Please make sure that you have run git lfs install.

Also, you can check the file size of pretrained.pt, which should be 443 MB.

@danpovey @csukuangfj

We use the model : tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt, and run the ：python -m pdb conformer_ctc/decode.py --epoch 99 --avg 1 --method ctc-decoding --max-duration 50 . Now, this script is still stuck, and I follow the code, the stuck happened in "lattice = k2.intersect_dense_pruned(",

this is the debug steps: #################### python -m pdb conformer_ctc/decode.py --epoch 99 --avg 1 --method ctc-decoding --max-duration 50

icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(19)() -> import argparse (Pdb) b 443 Breakpoint 1 at icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py:443 (Pdb) c 2021-10-11 16:44:25,291 INFO [decode.py:538] Decoding started 2021-10-11 16:44:25,291 INFO [decode.py:539] {'lm_dir': PosixPath('data/lm'), 'subsampling_factor': 4, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'feature_dim': 80, 'nhead': 8, 'attention_dim': 512, 'num_decoder_layers': 6, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 99, 'avg': 1, 'method': 'ctc-decoding', 'num_paths': 100, 'nbest_scale': 0.5, 'export': False, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe'), 'full_libri': True, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 50, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2} 2021-10-11 16:44:25,971 INFO [lexicon.py:113] Loading pre-compiled data/lang_bpe/Linv.pt 2021-10-11 16:44:26,111 INFO [decode.py:549] device: cuda 2021-10-11 16:44:31,376 INFO [checkpoint.py:92] Loading checkpoint from conformer_ctc/exp/epoch-99.pt 2021-10-11 16:44:32,270 INFO [decode.py:653] Number of model parameters: 116147120 icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(443)decode_dataset() -> results = defaultdict(list) (Pdb) n icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(444)decode_dataset() -> for batch_idx, batch in enumerate(dl): (Pdb) icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(445)decode_dataset() -> texts = batch["supervisions"]["text"] (Pdb) icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(447)decode_dataset() -> hyps_dict = decode_one_batch( (Pdb) p texts ["THE PRESENT CHAPTERS CAN ONLY TOUCH UPON THE MORE SALIENT MOVEMENTS OF THE CIVIL WAR IN KANSAS WHICH HAPPILY WERE NOT SANGUINARY IF HOWEVER THE INDIVIDUAL AND MORE ISOLATED CASES OF BLOODSHED COULD BE DESCRIBED THEY WOULD SHOW A STARTLING AGGREGATE OF BARBARITY AND LOSS OF LIFE FOR OPINION'S SAKE", 'THEN HE RUSHED DOWN STAIRS INTO THE COURTYARD SHOUTING LOUDLY FOR HIS SOLDIERS AND THREATENING TO PATCH EVERYBODY IN HIS DOMINIONS IF THE SAILORMAN WAS NOT RECAPTURED', 'SIR HARRY TOWNE MISTER BARTLEY ALEXANDER THE AMERICAN ENGINEER', 'BUT AT THIS POINT IN THE RAPIDS IT WAS IMPOSSIBLE FOR HIM TO STAY DOWN', 'HAKON THERE SHALL BE YOUR CONSTANT COMPANION FRIEND FARMER'] . . . (Pdb) icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(447)decode_dataset() -> hyps_dict = decode_one_batch( (Pdb) . . . -> lattice = get_lattice( (Pdb) s icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(266)decode_one_batch() -> nnet_output=nnet_output, (Pdb) icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(267)decode_one_batch() -> decoding_graph=decoding_graph, (Pdb) icefall/egs/librispeech/ASR_test_1/conformer_ctc/decode.py(268)decode_one_batch() -> supervision_segments=supervision_segments, (Pdb) --Call-- /workspace/icefall/icefall/decode.py(67)get_lattice() -> def get_lattice( (Pdb) s /workspace/icefall/icefall/decode.py(114)get_lattice() -> dense_fsa_vec = k2.DenseFsaVec( (Pdb) n /workspace/icefall/icefall/decode.py(115)get_lattice() -> nnet_output, (Pdb) /workspace/icefall/icefall/decode.py(116)get_lattice() -> supervision_segments, (Pdb) /workspace/icefall/icefall/decode.py(117)get_lattice() -> allow_truncate=subsampling_factor - 1, (Pdb) /workspace/icefall/icefall/decode.py(114)get_lattice() -> dense_fsa_vec = k2.DenseFsaVec( (Pdb) /workspace/icefall/icefall/decode.py(120)get_lattice() -> lattice = k2.intersect_dense_pruned( (Pdb) p dense_fsa_vec <k2.dense_fsa_vec.DenseFsaVec object at 0x7fefd19c9a90> (Pdb) n /workspace/icefall/icefall/decode.py(121)get_lattice() -> decoding_graph, (Pdb) /workspace/icefall/icefall/decode.py(122)get_lattice() -> dense_fsa_vec, (Pdb) /workspace/icefall/icefall/decode.py(123)get_lattice() -> search_beam=search_beam, (Pdb) /workspace/icefall/icefall/decode.py(124)get_lattice() -> output_beam=output_beam, (Pdb) /workspace/icefall/icefall/decode.py(125)get_lattice() -> min_active_states=min_active_states, (Pdb) /workspace/icefall/icefall/decode.py(126)get_lattice() -> max_active_states=max_active_states, (Pdb) /workspace/icefall/icefall/decode.py(120)get_lattice() -> lattice = k2.intersect_dense_pruned( (Pdb)

################

and ctrl-c when it gets stuck： ####### (Pdb)

/workspace/icefall/icefall/decode.py(120)get_lattice() -> lattice = k2.intersect_dense_pruned( (Pdb)

^C Program interrupted. (Use 'cont' to resume). --Call--

/opt/conda/lib/python3.8/bdb.py(321)set_trace() -> def set_trace(self, frame=None): (Pdb) Process Process-1: Traceback (most recent call last): File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 171, in _worker_loop r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 107, in get if not self._poll(timeout): File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 424, in _poll r = wait([self], timeout) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/opt/conda/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/opt/conda/lib/python3.8/pdb.py", line 194, in sigint_handler self.set_trace(frame) File "/opt/conda/lib/python3.8/bdb.py", line 321, in set_trace def set_trace(self, frame=None): File "/opt/conda/lib/python3.8/bdb.py", line 90, in trace_dispatch return self.dispatch_call(frame, arg) File "/opt/conda/lib/python3.8/bdb.py", line 135, in dispatch_call if self.quitting: raise BdbQuit bdb.BdbQuit

###########

python3 -m k2.version

Collecting environment information... k2 version: 1.9 Build type: Release Git SHA1: 8694fee66f564cf750792cb30c639d3cc404c18b Git date: Thu Sep 30 15:35:28 2021 Cuda used to build k2: 11.0 cuDNN used to build k2: 8.0.4 Python version used to build k2: 3.8 OS used to build k2: CMake version: 3.18.0 GCC version: 7.5.0 CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow PyTorch version used to build k2: 1.7.1 PyTorch is using Cuda: 11.0 NVTX enabled: True With CUDA: True Disable debug: True Sync kernels : False Disable checks: False

python --version Python 3.8.5 torch.version '1.7.1'

nvidia-smi

Mon Oct 11 17:38:22 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 | | N/A 32C P0 40W / 250W | 15620MiB / 16160MiB | 19% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 30C P0 35W / 250W | 1503MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+

cdxie commented 2 years ago

@Lzhang-hub you can give the "nvidia-smi" of Tesla V100 and A100-SXM4-40GB

Lzhang-hub commented 2 years ago

@Lzhang-hub you can give the "nvidia-smi" of Tesla V100 and A100-SXM4-40GB

Tesla V100 A100-SXM4-40GB

csukuangfj commented 2 years ago

Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2?

danpovey commented 2 years ago

. Yes, If you do the same with a debug version of k2, and with gdb instead of pdb (and with CUDA_LAUNCH_BLOCKING=1 set in the shell), .it may give us more information about the stack.

On Mon, Oct 11, 2021 at 7:07 PM Fangjun Kuang @.***> wrote:

Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-939925525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3DWIOTAHEBF4HNLPLUGLANNANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cdxie commented 2 years ago

Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2?

Ok, I will try the debug version of k2 using gdb . we reduce the --max-duration to 10, which can finshed the decode. so, we can not reproduce the --max-duration =200 or --method=attention-decoder

@Lzhang-hub you can give detailed feedback on some problem we met

cdxie commented 2 years ago

I suggest that you can use ctc decoding to verify your model according to #71. If the results based on ctc decoding are normal, maybe the problem happens to your language model.

@luomingshuang hi, language model was download using the given script:

we reduce the --max-duration to 10, which can finshed the decode using the --method ctc-decoding. so, what is the problem of language model?

Lzhang-hub commented 2 years ago

Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2?

Ok, I will try the debug version of k2 using gdb . we reduce the --max-duration to 10, which can finshed the decode. so, we can not reproduce the --max-duration =200 or --method=attention-decoder

@Lzhang-hub you can give detailed feedback on some problem we met

the decoding can be done with --method ctc-decoding and --max-duration 40, but get error when increase --max-duration to 50. when --max-duration =50, test-clean decoding can be done with %WER 4.22% ,test-other get the error:

2021-10-11 20:26:16,792 INFO [decode.py:474] batch 0/?, cuts processed until now is 11 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 31.75 GiB total capacity; 26.44 GiB already allocated; 1.75 MiB free; 30.60 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0fcb02a8b2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x2024b (0x7f0fcb28424b in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x21064 (0x7f0fcb285064 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x216ad (0x7f0fcb2856ad in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5d (0x7f0f905256dd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5: k2::NewRegion(std::shared_ptr, unsigned long) + 0x175 (0x7f0f90244e65 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0xea3 (0x7f0f903a1933 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f0f903a43de in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f0f904ea0cd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9: + 0xbd6df (0x7f10560f56df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #10: + 0x76db (0x7f10734bc6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #11: clone + 0x3f (0x7f10731e571f in /lib/x86_64-linux-gnu/libc.so.6) #######

luomingshuang commented 2 years ago

This may be not a error, just a warning. You can wait for the end of decoding if it not breaks. And wait some minutes, you will see the results.

Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2?

Ok, I will try the debug version of k2 using gdb . we reduce the --max-duration to 10, which can finshed the decode. so, we can not reproduce the --max-duration =200 or --method=attention-decoder @Lzhang-hub you can give detailed feedback on some problem we met

the decoding can be done with --method ctc-decoding and --max-duration 40, but get error when increase --max-duration to 50. when --max-duration =50, test-clean decoding can be done with %WER 4.22% ,test-other get the error:

2021-10-11 20:26:16,792 INFO [decode.py:474] batch 0/?, cuts processed until now is 11 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 31.75 GiB total capacity; 26.44 GiB already allocated; 1.75 MiB free; 30.60 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0fcb02a8b2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x2024b (0x7f0fcb28424b in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x21064 (0x7f0fcb285064 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x216ad (0x7f0fcb2856ad in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5d (0x7f0f905256dd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f0f90244e65 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0xea3 (0x7f0f903a1933 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f0f903a43de in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f0f904ea0cd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9: + 0xbd6df (0x7f10560f56df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #10: + 0x76db (0x7f10734bc6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #11: clone + 0x3f (0x7f10731e571f in /lib/x86_64-linux-gnu/libc.so.6) #######

danpovey commented 2 years ago

It might be an issue with the learning rate, interacting with num-gpus=1. We are trying a version with 4x lower learning rate.

On Mon, Oct 11, 2021 at 9:38 PM Mingshuang Luo @.***> wrote:

This may be not a error, just a warning. You can wait for the end of decoding if it not breaks. And wait some minutes, you will see the results.

Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2?

Ok, I will try the debug version of k2 using gdb . we reduce the --max-duration to 10, which can finshed the decode. so, we can not reproduce the --max-duration =200 or --method=attention-decoder @Lzhang-hub https://github.com/Lzhang-hub you can give detailed feedback on some problem we met

the decoding can be done with --method ctc-decoding and --max-duration 40, but get error when increase --max-duration to 50. when --max-duration =50, test-clean decoding can be done with %WER 4.22% ,test-other get the error:

2021-10-11 20:26:16,792 INFO [decode.py:474] batch 0/?, cuts processed until now is 11 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 31.75 GiB total capacity; 26.44 GiB already allocated; 1.75 MiB free; 30.60 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0fcb02a8b2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 https://github.com/k2-fsa/icefall/issues/1: + 0x2024b (0x7f0fcb28424b in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame

2 https://github.com/k2-fsa/icefall/issues/2: + 0x21064

(0x7f0fcb285064 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 https://github.com/k2-fsa/icefall/pull/3: + 0x216ad (0x7f0fcb2856ad in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 https://github.com/k2-fsa/icefall/pull/4: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5d (0x7f0f905256dd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 https://github.com/k2-fsa/icefall/pull/5: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f0f90244e65 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 https://github.com/k2-fsa/icefall/pull/6: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0xea3 (0x7f0f903a1933 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 https://github.com/k2-fsa/icefall/pull/7: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 https://github.com/k2-fsa/icefall/issues/1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f0f903a43de in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 https://github.com/k2-fsa/icefall/pull/8: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f0f904ea0cd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 https://github.com/k2-fsa/icefall/pull/9: + 0xbd6df (0x7f10560f56df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #10 https://github.com/k2-fsa/icefall/pull/10: + 0x76db (0x7f10734bc6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #11 https://github.com/k2-fsa/icefall/issues/11: clone + 0x3f (0x7f10731e571f in /lib/x86_64-linux-gnu/libc.so.6) #######

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/issues/70#issuecomment-940041004, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2IPUBSLFJCOERRVH3UGLSETANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cdxie commented 2 years ago

This may be not a error, just a warning. You can wait for the end of decoding if it not breaks. And wait some minutes, you will see the results.

Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2?

Ok, I will try the debug version of k2 using gdb . we reduce the --max-duration to 10, which can finshed the decode. so, we can not reproduce the --max-duration =200 or --method=attention-decoder @Lzhang-hub you can give detailed feedback on some problem we met

the decoding can be done with --method ctc-decoding and --max-duration 40, but get error when increase --max-duration to 50. when --max-duration =50, test-clean decoding can be done with %WER 4.22% ,test-other get the error: 2021-10-11 20:26:16,792 INFO [decode.py:474] batch 0/?, cuts processed until now is 11 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 31.75 GiB total capacity; 26.44 GiB already allocated; 1.75 MiB free; 30.60 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0fcb02a8b2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x2024b (0x7f0fcb28424b in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x21064 (0x7f0fcb285064 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x216ad (0x7f0fcb2856ad in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5d (0x7f0f905256dd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f0f90244e65 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0xea3 (0x7f0f903a1933 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f0f903a43de in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f0f904ea0cd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9: + 0xbd6df (0x7f10560f56df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #10: + 0x76db (0x7f10734bc6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #11: clone + 0x3f (0x7f10731e571f in /lib/x86_64-linux-gnu/libc.so.6) #######

@luomingshuang The main problem is that we can not run the --max-duration =200 or --method=attention-decoder or --method whole-lattice-rescoring in decoding with NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch 1.7.1 as yours. we often met the CUDA out of memory.

cdxie commented 2 years ago

It might be an issue with the learning rate, interacting with num-gpus=1. We are trying a version with 4x lower learning rate. … On Mon, Oct 11, 2021 at 9:38 PM Mingshuang Luo @.*> wrote: This may be not a error, just a warning. You can wait for the end of decoding if it not breaks. And wait some minutes, you will see the results. Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2? Ok, I will try the debug version of k2 using gdb . we reduce the --max-duration to 10, which can finshed the decode. so, we can not reproduce the --max-duration =200 or --method=attention-decoder @Lzhang-hub https://github.com/Lzhang-hub you can give detailed feedback on some problem we met the decoding can be done with --method ctc-decoding and --max-duration 40, but get error when increase --max-duration to 50. when --max-duration =50, test-clean decoding can be done with %WER 4.22% ,test-other get the error: 2021-10-11 20:26:16,792 INFO [decode.py:474] batch 0/?, cuts processed until now is 11 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 31.75 GiB total capacity; 26.44 GiB already allocated; 1.75 MiB free; 30.60 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0fcb02a8b2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>: + 0x2024b (0x7f0fcb28424b in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 <#2>: + 0x21064 (0x7f0fcb285064 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 <#3>: + 0x216ad (0x7f0fcb2856ad in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 <#4>: k2::PytorchCudaContext::Allocate(unsigned long, void) + 0x5d (0x7f0f905256dd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 <#5>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f0f90244e65 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 <#6>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0xea3 (0x7f0f903a1933 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 <#7>: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 <#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f0f903a43de in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 <#8>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f0f904ea0cd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 <#9>: + 0xbd6df (0x7f10560f56df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #10 <#10>: + 0x76db (0x7f10734bc6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #11 <#11>: clone + 0x3f (0x7f10731e571f in /lib/x86_64-linux-gnu/libc.so.6) ####### — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2IPUBSLFJCOERRVH3UGLSETANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

It might be an issue with the learning rate, interacting with num-gpus=1. We are trying a version with 4x lower learning rate. … On Mon, Oct 11, 2021 at 9:38 PM Mingshuang Luo @.*> wrote: This may be not a error, just a warning. You can wait for the end of decoding if it not breaks. And wait some minutes, you will see the results. Could you follow https://k2-fsa.github.io/k2/installation/for_developers.html to build a debug version of k2? Ok, I will try the debug version of k2 using gdb . we reduce the --max-duration to 10, which can finshed the decode. so, we can not reproduce the --max-duration =200 or --method=attention-decoder @Lzhang-hub https://github.com/Lzhang-hub you can give detailed feedback on some problem we met the decoding can be done with --method ctc-decoding and --max-duration 40, but get error when increase --max-duration to 50. when --max-duration =50, test-clean decoding can be done with %WER 4.22% ,test-other get the error: 2021-10-11 20:26:16,792 INFO [decode.py:474] batch 0/?, cuts processed until now is 11 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 31.75 GiB total capacity; 26.44 GiB already allocated; 1.75 MiB free; 30.60 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1607370172916/work/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0fcb02a8b2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>: + 0x2024b (0x7f0fcb28424b in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 <#2>: + 0x21064 (0x7f0fcb285064 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 <#3>: + 0x216ad (0x7f0fcb2856ad in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 <#4>: k2::PytorchCudaContext::Allocate(unsigned long, void) + 0x5d (0x7f0f905256dd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 <#5>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f0f90244e65 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 <#6>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0xea3 (0x7f0f903a1933 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 <#7>: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 <#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f0f903a43de in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 <#8>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f0f904ea0cd in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20211005+cuda11.0.torch1.7.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 <#9>: + 0xbd6df (0x7f10560f56df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #10 <#10>: + 0x76db (0x7f10734bc6db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #11 <#11>: clone + 0x3f (0x7f10731e571f in /lib/x86_64-linux-gnu/libc.so.6) ####### — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2IPUBSLFJCOERRVH3UGLSETANCNFSM5FTIMTGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

@danpovey , you means that if num-gpus=1, the training need to use 4x lower learning rate ?? we now can not support the nccl, so, at first we use num-gpus=1.

pzelasko commented 2 years ago

Yes, typically you'd want to scale the learning rate by the number of GPUs. The intuition is that the larger the batch size, the better the gradient estimate so you can take larger steps in optimization.

EDIT: you can also use gradient accumulation of 4 to make the training with 1 GPU equivalent (but 4x longer).

luomingshuang commented 2 years ago

Do your results seem normal? You can use the pretrained.pt from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc for ctc decoding and compare the results based on your model and pretrained.pt. Also, if the model does not converge well, it will lead to this issue. We are checking your problem.

I suggest that you can use ctc decoding to verify your model according to #71. If the results based on ctc decoding are normal, maybe the problem happens to your language model.

@luomingshuang hi, language model was download using the given script:

we reduce the --max-duration to 10, which can finshed the decode using the --method ctc-decoding. so, what is the problem of language model?

csukuangfj commented 2 years ago

Can we focus on the decoding issue first, as it is not related to the number of GPUs?

The issue is that they cannot decode with a pre-trained model provided by us, even using the same type of GPU, i.e. Tesla V100 with 32 GB RAM.

danpovey commented 2 years ago

Yes, the decoding issue with the pretrained model is more worrying for us as it may indicate a bug, but we don't know how to reproduce as we are using the same GPU. If that is hanging, it would be nice if you could reproduce with a debug version of k2, with CUDA_LAUNCH_BLOCKING=1, and break into it with ctrl-c so we can find out what lambda it is in.

csukuangfj commented 2 years ago

I notice that your k2 is

k2 version: 1.9
Build type: Release
Git SHA1: 8694fee66f564cf750792cb30c639d3cc404c18b
Git date: Thu Sep 30 15:35:28 2021
Cuda used to build k2: 11.0
CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow

Looks like you compiled k2 from source by yourself. If everything goes normal, your CMAKE_CUDA_FLAGS should contain only one type of compute arch, i.e, arch=compute_70,code=sm_70, as you are using Tesla V100. Not sure what's wrong with your configuration.

https://github.com/k2-fsa/icefall/issues/70#issuecomment-939877695 shows that you have two machines with different types of GPUs and CUDA versions:

Tesla V100, 32 GB, CUDA 11.0
A100-SXM4, 40GB, CUDA 11.2

Have you tried to use the pre-trained model to decode some sound file, not the whole test-clean or test-other dataset, in the above two machines? (Please see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model)

cdxie commented 2 years ago

Yes, the decoding issue with the pretrained model is more worrying for us as it may indicate a bug, but we don't know how to reproduce as we are using the same GPU. If that is hanging, it would be nice if you could reproduce with a debug version of k2, with CUDA_LAUNCH_BLOCKING=1, and break into it with ctrl-c so we can find out what lambda it is in. @

Can we focus on the decoding issue first, as it is not related to the number of GPUs?

The issue is that they cannot decode with a pre-trained model provided by us, even using the same type of GPU, i.e. Tesla V100 with 32 GB RAM.

@danpovey @csukuangfj @luomingshuang OK，I am trying to use debug version of k2 to run the decode with the pretrained model, if the same problem is happen, I will feedback to you. 1) We now can finished the docoding steps with the pretrained model using Tesla V100 with 32 GB RAM when reduce the --max-duration to 10 :

--method ctc-decoding: epoch-99.pt --max-duration=10 test-clean 3.11 test-other 7.66 --method whole-lattice-rescoring : epoch-99.pt --max-duration=10 test-clean 2.77 test-other 6.30

So, the main problem is that --method=attention-decoder or --max-duration > 40 will cause CUDA out of memory or stuck.

2) when we use our own conformer model, using --method whole-lattice-rescoring will stuck in decoding the test-other set, maybe my own model does not converge well.

3) we now use the pre-trained model to decode some sound file as you metioned, with the above two machines（v100 and a100), we will post the results later 4） I need to explain that, we are not to compile k2 on the physical machine, because of the use of cluster . I install the k2, icefall, lhotse using the dockerfile, the dockerfile I wrote is the following: Dockerfile.txt would you help us to check ？

cdxie commented 2 years ago

Yes, typically you'd want to scale the learning rate by the number of GPUs. The intuition is that the larger the batch size, the better the gradient estimate so you can take larger steps in optimization.

EDIT: you can also use gradient accumulation of 4 to make the training with 1 GPU equivalent (but 4x longer).

@luomingshuang OK，thanks, I will rerun the training by scaling the learning rate and larger batch size. Another question, why we set the the learning rate=0 in the epoch 0

csukuangfj commented 2 years ago

The learning rate is computed by the following formula: https://github.com/k2-fsa/icefall/blob/39bc8cae94cb3b5824a93b5033136fba546322b9/egs/librispeech/ASR/conformer_ctc/transformer.py#L767-L772

cdxie commented 2 years ago

The learning rate is computed by the following formula:

https://github.com/k2-fsa/icefall/blob/39bc8cae94cb3b5824a93b5033136fba546322b9/egs/librispeech/ASR/conformer_ctc/transformer.py#L767-L772

Thanks, I will run the training steps again.

danpovey commented 2 years ago

For the decoder OOM, you may be able to debug it using code similar to the following, used after a try..except. The main thing I am wondering is: is most of the memory use in linear tensors (1 axis) which would be generated by k2 algorithms, or in tensors with more axes which might come from PyTorch stuff? You could perhaps modify it to compute totals of bytes owned in 1-d vs more-than-1-d tensors.

import torch
import gc
for obj in gc.get_objects():
    try:
        if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
            print(type(obj), obj.size())
    except:
        pass

Lzhang-hub commented 2 years ago

I notice that your k2 is
k2 version: 1.9
Build type: Release
Git SHA1: 8694fee66f564cf750792cb30c639d3cc404c18b
Git date: Thu Sep 30 15:35:28 2021
Cuda used to build k2: 11.0
CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
Looks like you compiled k2 from source by yourself. If everything goes normal, your CMAKE_CUDA_FLAGS should contain only one type of compute arch, i.e, arch=compute_70,code=sm_70, as you are using Tesla V100. Not sure what's wrong with your configuration.

#70 (comment) shows that you have two machines with different types of GPUs and CUDA versions:

Tesla V100, 32 GB, CUDA 11.0

A100-SXM4, 40GB, CUDA 11.2

Have you tried to use the pre-trained model to decode some sound file, not the whole test-clean or test-other dataset, in the above two machines? (Please see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model)

we use the pre-trained model to decode some sound file according you advice,the result are following:

v100:
- CTC decoding: the result is right
- HLG decoding: the result is right
- HLG decoding + LM rescoring: the result is right
- HLG decoding + LM rescoring + attention decoder rescoring: error: Traceback (most recent call last): File "./conformer_ctc/pretrained.py", line 418, in <module> main() File "./conformer_ctc/pretrained.py", line 409, in main raise ValueError("Please use a supported decoding method.") ValueError: Please use a supported decoding method.
a100
- CTC decoding: the result is right
- HLG decoding: error: Segmentation fault (core dumped)
- HLG decoding + LM rescoring: error: Segmentation fault (core dumped)
- HLG decoding + LM rescoring + attention decoder rescoring: error: Segmentation fault (core dumped)