kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.24k stars 5.32k forks source link

Met errors in steps/nnet3/chain/get_egs.sh when train TDNN #3486

Closed ben-8878 closed 5 years ago

ben-8878 commented 5 years ago

train errors: 019-07-16 15:52:42,491 [steps/nnet3/chain/train.py:359 - train - INFO ] Generating egs steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --cmd run.pl --cmvn-opts --norm-means=false --norm-vars=false --online-ivector-dir --left-context 23 --right-context 23 --left-context-initial -1 --right-context-final -1 --left-tolerance 1 --right-tolerance 1 --frame-subsampling-factor 3 --alignment-subsampling-factor 1 --stage 0 --frames-per-iter 1000000 --frames-per-eg 150 --srand 0 data/train_sp exp/chain/tdnn_1c exp/chain_lats_1c exp/chain/tdnn_1c/egs steps/nnet3/chain/get_egs.sh: creating egs. To ensure they are not deleted later you can do: touch exp/chain/tdnn_1c/egs/.nodelete steps/nnet3/chain/get_egs.sh: feature type is raw tree-info exp/chain/tdnn_1c/tree steps/nnet3/chain/get_egs.sh: working out number of frames of training data steps/nnet3/chain/get_egs.sh: working out feature dim steps/nnet3/chain/get_egs.sh: creating 1832 archives, each with 6669 egs, with steps/nnet3/chain/get_egs.sh: 150 labels per example, and (left,right) context = (23,23) steps/nnet3/chain/get_egs.sh: Getting validation and training subset examples in background. steps/nnet3/chain/get_egs.sh: Generating training examples on disk ... Getting subsets of validation examples for diagnostics and combination. steps/nnet3/chain/get_egs.sh: recombining and shuffling order of archives on disk run.pl: 142 / 458 failed, log is in exp/chain/tdnn_1c/egs/log/shuffle.*.log Traceback (most recent call last): File "steps/nnet3/chain/train.py", line 622, in main train(args, run_opts) File "steps/nnet3/chain/train.py", line 385, in train stage=args.egs_stage) File "steps/libs/nnet3/train/chain_objf/acoustic_model.py", line 116, in generate_chain_egs egs_opts=egs_opts if egs_opts is not None else '')) File "steps/libs/common.py", line 157, in execute_command p.returncode, command)) Exception: Command exited with status 1: steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --cmd "run.pl" --cmvn-opts "--norm-means=false --norm-vars=false" --online-ivector-dir "" --left-context 23 --right-context 23 --left-context-initial -1 --right-context-final -1 --left-tolerance '1' --right-tolerance '1' --frame-subsampling-factor 3 --alignment-subsampling-factor 1 --stage 0 --frames-per-iter 1000000 --frames-per-eg 150 --srand 0 data/train_sp exp/chain/tdnn_1c exp/chain_lats_1c exp/chain/tdnn_1c/egs

*errors in shuffle..log:** `ERROR (nnet3-chain-shuffle-egs[5.5]:ExpectToken():io-funcs.cc:203) Failed to read token [started at file position -1], expected

[ Stack-Trace: ] kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const) kaldi::FatalMessageLogger::~FatalMessageLogger() kaldi::ExpectToken(std::istream&, bool, char const) kaldi::chain::Supervision::Read(std::istream&, bool) kaldi::nnet3::NnetChainSupervision::Read(std::istream&, bool) kaldi::nnet3::NnetChainExample::Read(std::istream&, bool) kaldi::KaldiObjectHolder::Read(std::istream&) kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder >::Next() kaldi::SequentialTableReader<kaldi::KaldiObjectHolder >::Next() main __libc_start_main _start

ERROR (nnet3-chain-shuffle-egs[5.5]:ExpectToken():io-funcs.cc:203) Failed to read token [started at file position -1], expected

[ Stack-Trace: ] kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const) kaldi::MessageLogger::~MessageLogger() kaldi::FatalMessageLogger::~FatalMessageLogger() kaldi::ExpectToken(std::istream&, bool, char const) kaldi::chain::Supervision::Read(std::istream&, bool) kaldi::nnet3::NnetChainSupervision::Read(std::istream&, bool) kaldi::nnet3::NnetChainExample::Read(std::istream&, bool) kaldi::KaldiObjectHolder::Read(std::istream&) kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder >::Next() kaldi::SequentialTableReader<kaldi::KaldiObjectHolder >::Next() main __libc_start_main _start

WARNING (nnet3-chain-shuffle-egs[5.5]:Read():util/kaldi-holder-inl.h:84) Exception caught reading Table object. WARNING (nnet3-chain-shuffle-egs[5.5]:Next():util/kaldi-table-inl.h:574) Object read failed, reading archive standard input LOG (nnet3-chain-shuffle-egs[5.5]:main():nnet3-chain-shuffle-egs.cc:104) Shuffled order of 3793 neural-network training examples ERROR (nnet3-chain-shuffle-egs[5.5]:~SequentialTableReaderArchiveImpl():util/kaldi-table-inl.h:678) TableReader: error detected closing archive standard input

[ Stack-Trace: ] kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*) kaldi::FatalMessageLogger::~FatalMessageLogger() kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder >::~SequentialTableReaderArchiveImpl() kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder >::~SequentialTableReaderArchiveImpl() kaldi::SequentialTableReader<kaldi::KaldiObjectHolder >::~SequentialTableReader() main __libc_start_main _start

ERROR (nnet3-chain-shuffle-egs[5.5]:~SequentialTableReaderArchiveImpl():util/kaldi-table-inl.h:678) TableReader: error detected closing archive standard input

[ Stack-Trace: ] kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*) kaldi::MessageLogger::~MessageLogger() kaldi::FatalMessageLogger::~FatalMessageLogger() kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder >::~SequentialTableReaderArchiveImpl() kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder >::~SequentialTableReaderArchiveImpl() kaldi::SequentialTableReader<kaldi::KaldiObjectHolder >::~SequentialTableReader() main __libc_start_main _start

terminate called after throwing an instance of 'std::runtime_error' what(): `

danpovey commented 5 years ago

Likely out of memory or out of disk space. Not a bug, and you shouldn't use github for this kind of thing, use kaldi-help, or preferably do a web search.

ben-8878 commented 5 years ago

@danpovey sorry, try a web search many times, but not solve it.

danpovey commented 5 years ago

You running too many jobs in parallel for one machine. By default it will run at most 50 (see option --max-shuffle-jobs-run to get_egs.sh, you can change it in the script), but that may still be too much. I.e. you are possibly out of memory. Or disk (check with df)

ben-8878 commented 5 years ago

@danpovey thank you very much

ben-8878 commented 5 years ago

@danpovey I change --max-shuffle-jobs-run to 10, it works and faster than "--max-shuffle-jobs-run=50" Maybe users needs to adjust the parameters--max-shuffle-jobs-run according to own machine . thank you very much!