Closed wentaoxandry closed 5 years ago
Hello, Old Fe @wentaoxandry
Could you please verify that the paths specified on the test section of your cfg file exist ? _(especially lab_datafolder )
John
Hallo @Johe-cqu ,
Thank for your reply. I have verfied, the paths are exist, lab_data_folder are also existed.
I think my decode logfiles are something wired, here is one of them.
latgen-faster-mapped --min-active=200 --max-active=7000 --max-mem=50000000 --beam=20.0 --lattice-beam=12.0 --acoustic-scale=0.10 --allow-partial=true --word-symbol-table=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/words.txt /home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali_eval/final.mdl /home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/HCLG.fst 'ark,s,cs: cat /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/exp_files/forward_eval_ep23_ck0_out_dnn2_to_decode.ark |' 'ark:|gzip -c > /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/decode_eval_out_dnn2/lat.1.gz' WARNING (latgen-faster-mapped[5.5.205~1-403c]:ProcessNonemitting():lattice-faster-decoder.cc:850) Error, no surviving tokens: frame is 1 WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] ASSERTION_FAILED (latgen-faster-mapped[5.5.205~1-403c]:PruneForwardLinks():lattice-faster-decoder.cc:351) : 'link_extra_cost == link_extra_cost'
[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const)
kaldi::FatalMessageLogger::~FatalMessageLogger()
kaldi::KaldiAssertFailure_(char const, char const, int, char const)
kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl
Hi, @wentaoxandry
Sorry, I have not encountered this situation. “ No tokens alive [doing pruning]” , Looks like there is a problem with generate lattices. But I suspect that you have handled the data set incorrectly. Maybe someone else can help you.
John
Hallo @Johe-cqu
Ok, but thank you for your help. By the way in log.log file, I found this problem:
ls: cannot access '/media/wentao/wentaodisk/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/exp_files/forward_eval_ep_ck_out_dnn2_to_decode.ark': No such file or directory.
I think there are something wrong with decoder.
cmd_decode=cmd+config['decoding']['decoding_script_folder'] +'/'+ config['decoding']['decoding_script']+ ' '+os.path.abspath(config_dec_file)+' '+ out_dec_folder + ' \"'+ files_dec + '\"'
run_shell(cmd_decode,log_file)
Hi, @wentaoxandry
“WER is nan ” should not be caused by this problem.
The reason for this log is that after you complete a run_exp.py, you repeatedly submit python run_exp.py cfg
without modifying the cfg file.
You should find the location of last hmm-info
in the log.log file and then send a few lines of text above hmm-info
.
John
Hi, @Johe-cqu
here are some info before hmm-info
ali-to-pdf /media/wentao/wentaodisk/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali//final.mdl 'ark:gunzip -c /media/wentao/wentaodisk/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali//ali.*.gz |' ark:- LOG (ali-to-pdf[5.5.205~1419-403c]:main():ali-to-pdf.cc:68) Converted 141636 alignments to pdf sequences. WARNING (analyze-counts[5.5.205~1419-403c]:main():analyze-counts.cc:144) Zero count for label 1, this is suspicious. WARNING (analyze-counts[5.5.205~1419-403c]:main():analyze-counts.cc:144) Zero count for label 2, this is suspicious. LOG (analyze-counts[5.5.205~1419-403c]:main():analyze-counts.cc:194) Summed 141636 int32 vectors to counts, skipped 0 vectors. LOG (analyze-counts[5.5.205~1419-403c]:main():analyze-counts.cc:196) Counts written to exp/lrs2_liGRU_fmllr/exp_files/forward_out_dnn2_lab_cd.count
3824
hmm-info /media/wentao/wentaodisk/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali/final.mdl
so what should I do now? Rerun the program or can I from somewhere fix it?
Thanks
Hi, @wentaoxandry
The fastest way is to add 1 to n_epochs_tr, and submit python run_exp.py cfg
repeatedly.
Finally, upload your log.log .
BTW, You'd better upload your cfg file too.
John
Hi, @Johe-cqu
Thank you so much!
Here is cfg file: [cfg_proto] cfg_proto=proto/global.proto cfg_proto_chunk=proto/global_chunk.proto
[exp] cmd= run_nn_script=run_nn out_folder=exp/lrs2_liGRU_fmllr seed=1234 use_cuda=True multi_gpu=False save_gpumem=False N_epochs_tr=24
[dataset1] data_name=pretrain_train fea:fea_name=fmllr fea_lst=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/pretrain_train/feats.scp fea_opts=apply-cmvn --utt2spk=ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/pretrain_train/utt2spk ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/pretrain_train/data/cmvn_pretrain_train.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=0 cw_right=0
lab:lab_name=lab_cd lab_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali/ lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/pretrain_train/ lab_graph=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/
N_chunks=200
[dataset2] data_name=dev fea:fea_name=fmllr fea_lst=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/dev/feats.scp fea_opts=apply-cmvn --utt2spk=ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/dev/utt2spk ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/dev/data/cmvn_dev.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=0 cw_right=0
lab:lab_name=lab_cd lab_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali_dev lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/dev/ lab_graph=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/
N_chunks=10
[dataset3] data_name=eval fea:fea_name=fmllr fea_lst=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/eval/feats.scp fea_opts=apply-cmvn --utt2spk=ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/eval/utt2spk ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/eval/data/cmvn_eval.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=0 cw_right=0
lab:lab_name=lab_cd lab_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali_eval lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/eval/ lab_graph=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/
N_chunks=8
[data_use] train_with=pretrain_train valid_with=dev forward_with=eval
[batches] batch_size_train=16 max_seq_length_train=500 increase_seq_length_train=True start_seq_len_train=100 multply_factor_seq_len_train=2 batch_size_valid=8 max_seq_length_valid=1000
[architecture1] arch_name = liGRU_layers arch_proto = proto/liGRU.proto arch_library = neural_networks arch_class = liGRU arch_pretrain_file = none arch_freeze = False arch_seq_model = True
ligru_lay = 550,550,550,550,550 ligru_drop = 0.2,0.2,0.2,0.2,0.2 ligru_use_laynorm_inp = False ligru_use_batchnorm_inp = False ligru_use_laynorm = False,False,False,False,False ligru_use_batchnorm = True,True,True,True,True ligru_bidir = True ligru_act = relu,relu,relu,relu,relu ligru_orthinit=True
arch_lr = 0.0002 arch_halving_factor = 0.5 arch_improvement_threshold = 0.001 arch_opt = rmsprop opt_momentum = 0.0 opt_alpha = 0.95 opt_eps = 1e-8 opt_centered = False opt_weight_decay = 0.0
[architecture2] arch_name=MLP_layers arch_proto=proto/MLP.proto arch_library=neural_networks arch_class=MLP arch_pretrain_file=none arch_freeze=False arch_seq_model=False dnn_lay=N_out_lab_cd dnn_drop=0.0 dnn_use_laynorm_inp=False dnn_use_batchnorm_inp=False dnn_use_batchnorm=False dnn_use_laynorm=False dnn_act=softmax
arch_lr=0.0002 arch_halving_factor=0.5 arch_improvement_threshold=0.001 arch_opt=rmsprop opt_momentum=0.0 opt_alpha=0.95 opt_eps=1e-8 opt_centered=False opt_weight_decay=0.0
[model] model_proto=proto/model.proto model:out_dnn1=compute(liGRU_layers,fmllr) out_dnn2=compute(MLP_layers,out_dnn1) loss_final=cost_nll(out_dnn2,lab_cd) err_final=cost_err(out_dnn2,lab_cd)
[forward] forward_out=out_dnn2 normalize_posteriors=True normalize_with_counts_from=lab_cd save_out_file=False require_decoding=True
[decoding] decoding_script_folder=kaldi_decoding_scripts decoding_script=decode_dnn.sh decoding_proto=proto/decoding.proto min_active=200 max_active=7000 max_mem=50000000 beam=20.0 latbeam=12.0 acwt=0.10 max_arcs=-1 skip_scoring=false scoring_script=/home/wentao/pytorch_kaldi/kaldi_prepare/local/score.sh scoring_opts="--min-lmwt 4 --max-lmwt 23" norm_vars=False
Hi, @wentaoxandry
Why do you set cw_left=0,cw_right=0,add-deltas --delta-order=0. I think it may cause “ No tokens alive [doing pruning]” . So Can you set cw_left=5,cw_right=5,add-deltas --delta-order=2 ,and try it again?
John
Hallo, @Johe-cqu
Thanks, I'm using fmllr as features, I thought it not necessary use context and add deltas again. But I will try your advice. Thank you very much, it's very helpful.
Wentao
Hi, @wentaoxandry
Please, let me know if everything is ok.
John
Hallo, @Johe-cqu,
I will first try, decode with cw_left=0,cw_right=0,add-deltas=0, I think I will get the results tomorrow. If it not working, I will try your advice. I will tell you if everything is ok as soon as possible.
Wentao
Hi, I don't think cw_left=0,cw_right=0,add-deltas=0 could be a problem. I think that a possible cause is that training didn't go well. Could you please post the res.res file in the output_folder?
On Tue, 12 Mar 2019 at 09:40, wentaoxandry notifications@github.com wrote:
Hallo, @Johe-cqu https://github.com/Johe-cqu,
I will first try, decode with cw_left=0,cw_right=0,add-deltas=0, I think I will get the results tomorrow. If it not working, I will try your advice. I will tell you if everything is ok as soon as possible.
Wentao
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/66#issuecomment-472004523, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1l-6ygcvJQMVuTqllUSJEgOOIdjKks5vV65NgaJpZM4bqPYr .
Hallo @mravanelli ,
Thank you for your reply, here is res.res file:
ep=00 tr=['pretrain_train'] loss=2.532 err=0.573 valid=dev loss=1.701 err=0.432 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=27674 ep=01 tr=['pretrain_train'] loss=1.980 err=0.476 valid=dev loss=1.467 err=0.381 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=27512 ep=02 tr=['pretrain_train'] loss=1.779 err=0.437 valid=dev loss=1.372 err=0.360 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=36300 ep=03 tr=['pretrain_train'] loss=1.686 err=0.418 valid=dev loss=1.331 err=0.351 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40374 ep=04 tr=['pretrain_train'] loss=1.640 err=0.409 valid=dev loss=1.295 err=0.342 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=44608 ep=05 tr=['pretrain_train'] loss=1.608 err=0.402 valid=dev loss=1.281 err=0.340 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=54788 ep=06 tr=['pretrain_train'] loss=1.582 err=0.397 valid=dev loss=1.269 err=0.337 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=51903 ep=07 tr=['pretrain_train'] loss=1.561 err=0.393 valid=dev loss=1.245 err=0.333 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=43224 ep=08 tr=['pretrain_train'] loss=1.544 err=0.389 valid=dev loss=1.241 err=0.332 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=43300 ep=09 tr=['pretrain_train'] loss=1.529 err=0.386 valid=dev loss=1.232 err=0.330 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40543 ep=10 tr=['pretrain_train'] loss=1.516 err=0.384 valid=dev loss=1.231 err=0.330 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40526 ep=11 tr=['pretrain_train'] loss=1.480 err=0.376 valid=dev loss=1.206 err=0.323 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=40495 ep=12 tr=['pretrain_train'] loss=1.496 err=0.380 valid=dev loss=1.218 err=0.326 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40535 ep=13 tr=['pretrain_train'] loss=1.464 err=0.373 valid=dev loss=1.202 err=0.322 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=40550 ep=14 tr=['pretrain_train'] loss=1.481 err=0.376 valid=dev loss=1.209 err=0.323 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40512 ep=15 tr=['pretrain_train'] loss=1.451 err=0.370 valid=dev loss=1.193 err=0.320 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=40586 ep=16 tr=['pretrain_train'] loss=1.469 err=0.374 valid=dev loss=1.204 err=0.322 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=53597 ep=17 tr=['pretrain_train'] loss=1.440 err=0.368 valid=dev loss=1.196 err=0.320 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=47637 ep=18 tr=['pretrain_train'] loss=1.459 err=0.372 valid=dev loss=1.201 err=0.320 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=47428 ep=19 tr=['pretrain_train'] loss=1.431 err=0.366 valid=dev loss=1.187 err=0.318 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=48562 ep=20 tr=['pretrain_train'] loss=1.451 err=0.370 valid=dev loss=1.203 err=0.321 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=50639 ep=21 tr=['pretrain_train'] loss=1.423 err=0.364 valid=dev loss=1.181 err=0.317 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=54642 ep=22 tr=['pretrain_train'] loss=1.443 err=0.369 valid=dev loss=1.194 err=0.319 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=48711 ep=23 tr=['pretrain_train'] loss=nan err=0.863 valid=dev loss=nan err=0.907 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=46011 %WER -nan [ 0 / 0, 0 ins, 0 del, 0 sub ] [PARTIAL] /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/decode_eval_out_dnn2/wer_10_0.0
From the res.res file we can see that the last losses are nan. The problem comes from the training. You don't need cw_left and cw_right (you are using RNNs i.e not mandatory) same for the derivarives (deltas) not needed for FMLLR feats. I suggest you to restart from scratch with a clean exp directory, and to see if the problem still arises at the end.
Please, also consider updating your version of pytorch-kaldi, we recently fixed a bug with the learning rates.
Yes, everything went well except the final epoch. Also, I would suggest you to update your pytorch-kaldi version. We recently fix a learning rate issue that could be the cause of the problem. Please, keep us updated!
Mirco
On Tue, 12 Mar 2019 at 10:15, Parcollet Titouan notifications@github.com wrote:
From the res.res file we can see that the last losses are nan. The problem comes from the training. You don't need cw_left and cw_right (you are using RNNs i.e not mandatory) same for the derivarives (deltas) not needed for FMLLR feats. I suggest you to restart from scratch with a clean exp directory, and to see if the problem still arise at the end.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/66#issuecomment-472018413, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1oD4EtiAwgoWd-e2izA75HSyyYj8ks5vV7Z6gaJpZM4bqPYr .
Hallo @mravanelli @TParcollet ,
I have tried again, but I still got loss = nan, but don't like last time, nan only at the final epoch. This time nan is at epoch 18. Here is res.res: ep=00 tr=['pretrain_train'] loss=2.533 err=0.573 valid=dev loss=1.686 err=0.427 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=16923 ep=01 tr=['pretrain_train'] loss=1.980 err=0.476 valid=dev loss=1.464 err=0.380 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15644 ep=02 tr=['pretrain_train'] loss=1.778 err=0.436 valid=dev loss=1.370 err=0.359 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=21928 ep=03 tr=['pretrain_train'] loss=1.685 err=0.418 valid=dev loss=1.329 err=0.351 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=25619 ep=04 tr=['pretrain_train'] loss=1.640 err=0.409 valid=dev loss=1.300 err=0.344 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=25889 ep=05 tr=['pretrain_train'] loss=1.607 err=0.402 valid=dev loss=1.280 err=0.340 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=26456 ep=06 tr=['pretrain_train'] loss=1.581 err=0.397 valid=dev loss=1.273 err=0.337 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28385 ep=07 tr=['pretrain_train'] loss=1.561 err=0.393 valid=dev loss=1.250 err=0.334 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28423 ep=08 tr=['pretrain_train'] loss=1.543 err=0.389 valid=dev loss=1.241 err=0.331 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28375 ep=09 tr=['pretrain_train'] loss=1.529 err=0.386 valid=dev loss=1.223 err=0.327 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28502 ep=10 tr=['pretrain_train'] loss=1.516 err=0.384 valid=dev loss=1.213 err=0.325 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=29509 ep=11 tr=['pretrain_train'] loss=1.506 err=0.382 valid=dev loss=1.200 err=0.321 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28789 ep=12 tr=['pretrain_train'] loss=1.495 err=0.380 valid=dev loss=1.198 err=0.320 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=24812 ep=13 tr=['pretrain_train'] loss=1.487 err=0.378 valid=dev loss=1.184 err=0.318 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=24025 ep=14 tr=['pretrain_train'] loss=1.478 err=0.376 valid=dev loss=1.197 err=0.319 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=23997 ep=15 tr=['pretrain_train'] loss=1.445 err=0.369 valid=dev loss=1.177 err=0.314 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=24035 ep=16 tr=['pretrain_train'] loss=231381451.427 err=0.367 valid=dev loss=1.179 err=0.315 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=24003 ep=17 tr=['pretrain_train'] loss=1.448 err=0.364 valid=dev loss=1.185 err=0.315 lr_architecture1=5e-05 lr_architecture2=5e-05 time(s)=24128 ep=18 tr=['pretrain_train'] loss=nan err=0.514 valid=dev loss=nan err=0.904 lr_architecture1=2.5e-05 lr_architecture2=2.5e-05 time(s)=24035 ep=19 tr=['pretrain_train'] loss=nan err=0.956 valid=dev loss=nan err=0.904 lr_architecture1=1.25e-05 lr_architecture2=1.25e-05 time(s)=23997
Ok the very weird stuff is actually happening at ep 16, the train loss just exploded, causing the oscillation and the final nan. But why ... @mravanelli do you think that we could have a divide by zero somewhere ? Like during the normalisation or something ? Or a problem with batch_norm and ReLU ?
Could you try to run this experiment with a very simple MLP (that should be way faster). If it also explodes, it is because of the data. If it does not, it might be because of the architecture.
@TParcollet ok, I will try that
Hi, you are probably experimenting some numerical instabilities. Are you using the Li-GRU model, right? In all my past experiments I was able to get rid of these numerical issues coupling the ReLU activation with batch normalization (this is the model you are currently using, according to the config file you sent us). This trick helps a lot, but cannot give 100% guarantee of numerical stability. As suggested by Titouan you can try with a MLP model and make sure that everything is working with that. Then, you can try to run a recurrent neural network with the standard GRU model, that is very numerical stable. If you still want to try the Li-GRU model, you might try to uncomment the lines in "core.py" related to gradient clipping. Gradient clipping is normally not needed, but can potentially help in your case.
Please, keep us updated!
Best,
Mirco
On Tue, 19 Mar 2019 at 07:18, wentaoxandry notifications@github.com wrote:
@TParcollet https://github.com/TParcollet ok, I will try that
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/66#issuecomment-474315788, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1uTWXioCl6q6UhREEbSxezT7mhgOks5vYMdxgaJpZM4bqPYr .
@mravanelli Thank you for your help, I will try it.
Hallo,
I'm trying run the program with LRS2 dataset, but at the end I got the follow result: Decoding eval output out_dnn2 %WER -nan [ 0 / 0, 0 ins, 0 del, 0 sub ] [PARTIAL] /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/decode_eval_out_dnn2/wer_10_0.0 I want to ask if I make something wrong? How can I fix it.
Thank you Wentao