kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.24k stars 5.32k forks source link

A bug in Kaldi multi_cn run.sh #4718

Open liyuhui opened 2 years ago

liyuhui commented 2 years ago

i use multi_cn/s5/run.sh train model, in the last step local/chain/run_cnn_tdnn.sh , find below bug:

run.pl: job failed, log is in exp/chain_cleaned/tdnn_cnn_1a_sp/log/train.1.3.log
2022-03-28 05:58:19,735 [/opt/kaldi/egs/multi_cn/s5/steps/libs/common.py:236 - background_command_waiter - ERROR ] Command exited with status 1: run.pl --mem 4G --gpu 1 exp/chain_cleaned/tdnn_cnn_1a_sp/log/train.1.3.log                     nnet3-chain-train --use-gpu=wait                      --apply-deriv-weights=False                     --l2-regularize=0.0 --leaky-hmm-coefficient=0.1                     --read-cache=exp/chain_cleaned/tdnn_cnn_1a_sp/cache.1  --xent-regularize=0.1                                          --print-interval=10 --momentum=0.0                     --max-param-change=2.0                     --backstitch-training-scale=0.0                     --backstitch-training-interval=1                     --l2-regularize-factor=0.3333333333333333 --optimization.memory-compression-level=2                     --srand=1                     "nnet3-am-copy --raw=true --learning-rate=0.00044762974304870596 --scale=1.0 exp/chain_cleaned/tdnn_cnn_1a_sp/1.mdl - |nnet3-copy --edits='set-dropout-proportion name=* proportion=0.0' - - |" exp/chain_cleaned/tdnn_cnn_1a_sp/den.fst                     "ark,bg:nnet3-chain-copy-egs                          --frame-shift=0                         ark:exp/chain_cleaned/tdnn_cnn_1a_sp/egs/cegs.6.ark ark:- |                         nnet3-chain-shuffle-egs --buffer-size=5000                         --srand=1 ark:- ark:- | nnet3-chain-merge-egs                         --minibatch-size=128,64 ark:- ark:- |"                     exp/chain_cleaned/tdnn_cnn_1a_sp/2.3.raw

the exp/chain_cleaned/tdnn_cnn_1a_sp/log/train.1.3.log says:

ERROR (nnet3-chain-train[5.5]:ExecuteCommand():nnet-compute.cc:445) Error running command c247: tdnnf17.relu.Propagate(NULL, m126, &m126)

[ Stack-Trace: ]
/opt/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f9787f912aa]
nnet3-chain-train(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x4115d1]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::ExecuteCommand()+0x13d1) [0x7f9789fef00f]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetComputer::Run()+0x18a) [0x7f9789fef22e]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::TrainInternal(kaldi::nnet3::NnetChainExample const&, kaldi::nnet3::NnetComputation const&)+0x5b) [0x7f978a04116d]
/opt/kaldi/src/lib/libkaldi-nnet3.so(kaldi::nnet3::NnetChainTrainer::Train(kaldi::nnet3::NnetChainExample const&)+0x19d) [0x7f978a0415c5]
nnet3-chain-train(main+0x84d) [0x4103f3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f97870e4830]
nnet3-chain-train(_start+0x29) [0x40fad9]

Would someone be able to help debug this issue?

svenha commented 2 years ago

Just guessing from my own experience. Do you have enough GPU memory and have you set the GPU to exclusive mode?

jtrmal commented 2 years ago

yeah, I would also guess gpu memory or system memory or something like that. y.

On Tue, May 24, 2022 at 7:31 AM svenha @.***> wrote:

Just guessing from my own experience. Do you have enough GPU memory and have you set the GPU to exclusive mode?

— Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4718#issuecomment-1135795746, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX3JMIQTWSSUDI3VEJTVLS4XXANCNFSM5R474PZQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

stale[bot] commented 2 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.