kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.17k stars 5.32k forks source link

Failed to write matrix to stream when run DNN training #2850

Closed tingweiwu closed 5 years ago

tingweiwu commented 5 years ago

I run thchs30 everything go well until run dnn. I got an Failed to write matrix ERROR. but I don't know why ?

#train dnn model
local/nnet/run_dnn.sh --stage 0 --nj $gpu_n  $H/exp/tri4b $H/exp/tri4b_ali $H/exp/tri4b_ali_cv $out || exit 1;
DNN training: stage 0: feature generation
producing fbank for train
steps/make_fbank.sh --nj 1 --cmd run.pl data/fbank/train /model/a7126a64292747a7bd98cfbfb0f21f61//exp/make_fbank/train fbank/train
utils/validate_data_dir.sh: Successfully validated data-directory data/fbank/train
steps/make_fbank.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
Succeeded creating filterbank features for train
steps/compute_cmvn_stats.sh data/fbank/train /model/a7126a64292747a7bd98cfbfb0f21f61//exp/fbank_cmvn/train fbank/train
Succeeded creating CMVN stats for train
producing fbank for dev
steps/make_fbank.sh --nj 1 --cmd run.pl data/fbank/dev /model/a7126a64292747a7bd98cfbfb0f21f61//exp/make_fbank/dev fbank/dev
utils/validate_data_dir.sh: Successfully validated data-directory data/fbank/dev
steps/make_fbank.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
Succeeded creating filterbank features for dev
steps/compute_cmvn_stats.sh data/fbank/dev /model/a7126a64292747a7bd98cfbfb0f21f61//exp/fbank_cmvn/dev fbank/dev
Succeeded creating CMVN stats for dev
producing fbank for test
steps/make_fbank.sh --nj 1 --cmd run.pl data/fbank/test /model/a7126a64292747a7bd98cfbfb0f21f61//exp/make_fbank/test fbank/test
utils/validate_data_dir.sh: Successfully validated data-directory data/fbank/test
steps/make_fbank.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
Succeeded creating filterbank features for test
steps/compute_cmvn_stats.sh data/fbank/test /model/a7126a64292747a7bd98cfbfb0f21f61//exp/fbank_cmvn/test fbank/test
Succeeded creating CMVN stats for test
producing test_fbank_phone
# steps/nnet/train.sh --copy_feats false --cmvn-opts "--norm-means=true --norm-vars=false" --hid-layers 4 --hid-dim 1024 --learn-rate 0.008 data/fbank/train data/fbank/dev data/lang /code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali /code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali_cv /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn 
# Started at Mon Nov 19 11:44:52 UTC 2018
#
steps/nnet/train.sh --copy_feats false --cmvn-opts --norm-means=true --norm-vars=false --hid-layers 4 --hid-dim 1024 --learn-rate 0.008 data/fbank/train data/fbank/dev data/lang /code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali /code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali_cv /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn

# INFO
steps/nnet/train.sh : Training Neural Network
     dir       : /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn 
     Train-set : data/fbank/train 10000, /code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali 
     CV-set    : data/fbank/dev 893 /code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali_cv 

LOG ([5.5.0~1-5b23]:main():cuda-gpu-available.cc:49) 

### IS CUDA GPU AVAILABLE? 'a67467bbbb01450899b988a93cacbac5-single-0' ###
WARNING ([5.5.0~1-5b23]:SelectGpuId():cu-device.cc:203) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:323) Selecting from 8 GPUs
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(0): Tesla V100-PCIE-16GB  free:15722M, used:430M, total:16152M, free/total:0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(1): Tesla V100-PCIE-16GB  free:15722M, used:430M, total:16152M, free/total:0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(2): Tesla V100-PCIE-16GB  free:15722M, used:430M, total:16152M, free/total:0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(3): Tesla V100-PCIE-16GB  free:15722M, used:430M, total:16152M, free/total:0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(4): Tesla V100-PCIE-16GB  free:15722M, used:430M, total:16152M, free/total:0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(5): Tesla V100-PCIE-16GB  free:15722M, used:430M, total:16152M, free/total:0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(6): Tesla V100-PCIE-16GB  free:15722M, used:430M, total:16152M, free/total:0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:338) cudaSetDevice(7): Tesla V100-PCIE-16GB  free:15722M, used:430M, total:16152M, free/total:0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:385) Trying to select device: 0 (automatically), mem_ratio: 0.973378
LOG ([5.5.0~1-5b23]:SelectGpuIdAuto():cu-device.cc:404) Success selecting device 0 free mem ratio: 0.973378
LOG ([5.5.0~1-5b23]:FinalizeActiveGpu():cu-device.cc:258) The active GPU is [0]: Tesla V100-PCIE-16GB   free:15620M, used:532M, total:16152M, free/total:0.967064 version 7.0
### HURRAY, WE GOT A CUDA GPU FOR COMPUTATION!!! ##

### Testing CUDA setup with a small computation (setup = cuda-toolkit + gpu-driver + kaldi):
### Test OK!

# PREPARING ALIGNMENTS
Using PDF targets from dirs '/code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali' '/code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali_cv'
hmm-info /code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali/final.mdl 
copy-transition-model --binary=false /code/1cdbb915397448d9b1316aa25ecaab8a/exp/tri4b_ali/final.mdl /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/final.mdl 
LOG (copy-transition-model[5.5.0~1-5b23]:main():copy-transition-model.cc:62) Copied transition model.

# PREPARING FEATURES
# + 'apply-cmvn' with '--norm-means=true --norm-vars=false' using statistics : data/fbank/train/cmvn.scp, data/fbank/dev/cmvn.scp
feat-to-dim 'ark:copy-feats scp:/model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/train.scp.10k ark:- | apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/fbank/train/utt2spk scp:data/fbank/train/cmvn.scp ark:- ark:- |' - 
copy-feats scp:/model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/train.scp.10k ark:- 
apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/fbank/train/utt2spk scp:data/fbank/train/cmvn.scp ark:- ark:- 
ERROR (apply-cmvn[5.5.0~1-5b23]:Write():kaldi-matrix.cc:1403) Failed to write matrix to stream

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::MatrixBase<float>::Write(std::ostream&, bool) const
kaldi::KaldiObjectHolder<kaldi::Matrix<float> >::Write(std::ostream&, bool, kaldi::Matrix<float> const&)
kaldi::TableWriterArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::Matrix<float> const&)
main
__libc_start_main
_start

ERROR (apply-cmvn[5.5.0~1-5b23]:Write():kaldi-matrix.cc:1403) Failed to write matrix to stream

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::MatrixBase<float>::Write(std::ostream&, bool) const
kaldi::KaldiObjectHolder<kaldi::Matrix<float> >::Write(std::ostream&, bool, kaldi::Matrix<float> const&)
kaldi::TableWriterArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::Matrix<float> const&)
main
__libc_start_main
_start

WARNING (apply-cmvn[5.5.0~1-5b23]:Write():util/kaldi-holder-inl.h:57) Exception caught writing Table object. 
WARNING (apply-cmvn[5.5.0~1-5b23]:Write():util/kaldi-table-inl.h:1057) Write failure to standard output
ERROR (apply-cmvn[5.5.0~1-5b23]:Write():util/kaldi-table-inl.h:1515) Error in TableWriter::Write

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
main
__libc_start_main
_start

ERROR (apply-cmvn[5.5.0~1-5b23]:Write():util/kaldi-table-inl.h:1515) Error in TableWriter::Write

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::FatalMessageLogger::~FatalMessageLogger()
main
__libc_start_main
_start

WARNING (apply-cmvn[5.5.0~1-5b23]:Close():util/kaldi-table-inl.h:1089) Error closing stream: wspecifier is ark:-
ERROR (apply-cmvn[5.5.0~1-5b23]:~TableWriter():util/kaldi-table-inl.h:1539) Error closing TableWriter [in destructor].

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::TableWriter<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::~TableWriter()
main
__libc_start_main
_start

ERROR (copy-feats[5.5.0~1-5b23]:Write():kaldi-matrix.cc:1403) Failed to write matrix to stream

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::MatrixBase<float>::Write(std::ostream&, bool) const
kaldi::KaldiObjectHolder<kaldi::Matrix<float> >::Write(std::ostream&, bool, kaldi::Matrix<float> const&)
kaldi::TableWriterArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::Matrix<float> const&)
main
__libc_start_main
_start

ERROR (copy-feats[5.5.0~1-5b23]:Write():kaldi-matrix.cc:1403) Failed to write matrix to stream

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::MatrixBase<float>::Write(std::ostream&, bool) const
kaldi::KaldiObjectHolder<kaldi::Matrix<float> >::Write(std::ostream&, bool, kaldi::Matrix<float> const&)
kaldi::TableWriterArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::Matrix<float> const&)
main
__libc_start_main
_start

WARNING (copy-feats[5.5.0~1-5b23]:Write():util/kaldi-holder-inl.h:57) Exception caught writing Table object. 
WARNING (copy-feats[5.5.0~1-5b23]:Write():util/kaldi-table-inl.h:1057) Write failure to standard output
ERROR (copy-feats[5.5.0~1-5b23]:Write():util/kaldi-table-inl.h:1515) Error in TableWriter::Write

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
main
__libc_start_main
_start

ERROR (copy-feats[5.5.0~1-5b23]:Write():util/kaldi-table-inl.h:1515) Error in TableWriter::Write

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::FatalMessageLogger::~FatalMessageLogger()
main
__libc_start_main
_start

WARNING (copy-feats[5.5.0~1-5b23]:Close():util/kaldi-table-inl.h:1089) Error closing stream: wspecifier is ark:-
ERROR (copy-feats[5.5.0~1-5b23]:~TableWriter():util/kaldi-table-inl.h:1539) Error closing TableWriter [in destructor].

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::TableWriter<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::~TableWriter()
main
__libc_start_main
_start

sh: line 1:   547 Aborted                 (core dumped) copy-feats scp:/model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/train.scp.10k ark:-
       548                       (core dumped) | apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/fbank/train/utt2spk scp:data/fbank/train/cmvn.scp ark:- ark:-
WARNING (feat-to-dim[5.5.0~1-5b23]:Close():kaldi-io.cc:515) Pipe copy-feats scp:/model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/train.scp.10k ark:- | apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/fbank/train/utt2spk scp:data/fbank/train/cmvn.scp ark:- ark:- | had nonzero return status 34304
# feature dim : 40 (input of 'feature_transform')
# + default 'feature_transform_proto' with splice +/-5 frames,
nnet-initialize --binary=false /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/splice5.proto /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/tr_splice5.nnet 
VLOG[1] (nnet-initialize[5.5.0~1-5b23]:Init():nnet-nnet.cc:314) <Splice> <InputDim> 40 <OutputDim> 440 <BuildVector> -5:5 </BuildVector>
LOG (nnet-initialize[5.5.0~1-5b23]:main():nnet-initialize.cc:63) Written initialized model to /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/tr_splice5.nnet
# feature type : plain
# compute normalization stats from 10k sentences
compute-cmvn-stats ark:- /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/cmvn-g.stats 
nnet-forward --print-args=true --use-gpu=yes /model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/tr_splice5.nnet 'ark:copy-feats scp:/model/a7126a64292747a7bd98cfbfb0f21f61//exp/tri4b_dnn/train.scp.10k ark:- | apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/fbank/train/utt2spk scp:data/fbank/train/cmvn.scp ark:- ark:- |' ark:- 
danpovey commented 5 years ago

Possibly disk full.

tingweiwu commented 5 years ago

@danpovey the reason may not be disk full. because we use nfs and still have 40T free. Is there any other possible reason? thx

danpovey commented 5 years ago

Check if /tmp is full, sometimes nnet1 scripts make use of /tmp. That setup is super out-of-date. You'd get better results with nnet3 chain scripts.

poor1017 commented 4 years ago

It's pipe error, not space issue.