mravanelli / pytorch-kaldi

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.
2.36k stars 446 forks source link

WER is nan #66

Closed wentaoxandry closed 5 years ago

wentaoxandry commented 5 years ago

Hallo,

I'm trying run the program with LRS2 dataset, but at the end I got the follow result: Decoding eval output out_dnn2 %WER -nan [ 0 / 0, 0 ins, 0 del, 0 sub ] [PARTIAL] /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/decode_eval_out_dnn2/wer_10_0.0 I want to ask if I make something wrong? How can I fix it.

Thank you Wentao

Johe-cqu commented 5 years ago

Hello, Old Fe @wentaoxandry

Could you please verify that the paths specified on the test section of your cfg file exist ? _(especially lab_datafolder )

John

wentaoxandry commented 5 years ago

Hallo @Johe-cqu ,

Thank for your reply. I have verfied, the paths are exist, lab_data_folder are also existed.

I think my decode logfiles are something wired, here is one of them.

latgen-faster-mapped --min-active=200 --max-active=7000 --max-mem=50000000 --beam=20.0 --lattice-beam=12.0 --acoustic-scale=0.10 --allow-partial=true --word-symbol-table=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/words.txt /home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali_eval/final.mdl /home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/HCLG.fst 'ark,s,cs: cat /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/exp_files/forward_eval_ep23_ck0_out_dnn2_to_decode.ark |' 'ark:|gzip -c > /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/decode_eval_out_dnn2/lat.1.gz' WARNING (latgen-faster-mapped[5.5.205~1-403c]:ProcessNonemitting():lattice-faster-decoder.cc:850) Error, no surviving tokens: frame is 1 WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] WARNING (latgen-faster-mapped[5.5.205~1-403c]:PruneTokensForFrame():lattice-faster-decoder.cc:496) No tokens alive [doing pruning] ASSERTION_FAILED (latgen-faster-mapped[5.5.205~1-403c]:PruneForwardLinks():lattice-faster-decoder.cc:351) : 'link_extra_cost == link_extra_cost'

[ Stack-Trace: ] kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const) kaldi::FatalMessageLogger::~FatalMessageLogger() kaldi::KaldiAssertFailure_(char const, char const, int, char const) kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl > >, kaldi::decoder::StdToken>::PruneForwardLinks(int, bool, bool, float) kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl > >, kaldi::decoder::StdToken>::PruneActiveTokens(float) kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl > >, kaldi::decoder::StdToken>::Decode(kaldi::DecodableInterface) bool kaldi::DecodeUtteranceLatticeFaster<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl > > >(kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl > >, kaldi::decoder::StdToken>&, kaldi::DecodableInterface&, kaldi::TransitionModel const&, fst::SymbolTable const, std::cxx11::basic_string<char, std::char_traits, std::allocator >, double, bool, bool, kaldi::TableWriter<kaldi::BasicVectorHolder >, kaldi::TableWriter<kaldi::BasicVectorHolder >, kaldi::TableWriter, kaldi::TableWriter, double*) main libc_start_main _start

Johe-cqu commented 5 years ago

Hi, @wentaoxandry

Sorry, I have not encountered this situation. “ No tokens alive [doing pruning]” , Looks like there is a problem with generate lattices. But I suspect that you have handled the data set incorrectly. Maybe someone else can help you.

John

wentaoxandry commented 5 years ago

Hallo @Johe-cqu

Ok, but thank you for your help. By the way in log.log file, I found this problem:

ls: cannot access '/media/wentao/wentaodisk/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/exp_files/forward_eval_ep_ck_out_dnn2_to_decode.ark': No such file or directory.

I think there are something wrong with decoder.

Run the decoder

            cmd_decode=cmd+config['decoding']['decoding_script_folder'] +'/'+ config['decoding']['decoding_script']+ ' '+os.path.abspath(config_dec_file)+' '+ out_dec_folder + ' \"'+ files_dec + '\"' 
            run_shell(cmd_decode,log_file)
Johe-cqu commented 5 years ago

Hi, @wentaoxandry

“WER is nan ” should not be caused by this problem. The reason for this log is that after you complete a run_exp.py, you repeatedly submit python run_exp.py cfg without modifying the cfg file.

You should find the location of last hmm-info in the log.log file and then send a few lines of text above hmm-info.

John

wentaoxandry commented 5 years ago

Hi, @Johe-cqu

here are some info before hmm-info

ali-to-pdf /media/wentao/wentaodisk/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali//final.mdl 'ark:gunzip -c /media/wentao/wentaodisk/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali//ali.*.gz |' ark:- LOG (ali-to-pdf[5.5.205~1419-403c]:main():ali-to-pdf.cc:68) Converted 141636 alignments to pdf sequences. WARNING (analyze-counts[5.5.205~1419-403c]:main():analyze-counts.cc:144) Zero count for label 1, this is suspicious. WARNING (analyze-counts[5.5.205~1419-403c]:main():analyze-counts.cc:144) Zero count for label 2, this is suspicious. LOG (analyze-counts[5.5.205~1419-403c]:main():analyze-counts.cc:194) Summed 141636 int32 vectors to counts, skipped 0 vectors. LOG (analyze-counts[5.5.205~1419-403c]:main():analyze-counts.cc:196) Counts written to exp/lrs2_liGRU_fmllr/exp_files/forward_out_dnn2_lab_cd.count

3824

hmm-info /media/wentao/wentaodisk/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali/final.mdl

so what should I do now? Rerun the program or can I from somewhere fix it?

Thanks

Johe-cqu commented 5 years ago

Hi, @wentaoxandry

The fastest way is to add 1 to n_epochs_tr, and submit python run_exp.py cfg repeatedly. Finally, upload your log.log .

BTW, You'd better upload your cfg file too.

John

wentaoxandry commented 5 years ago

Hi, @Johe-cqu

Thank you so much!

Here is cfg file: [cfg_proto] cfg_proto=proto/global.proto cfg_proto_chunk=proto/global_chunk.proto

[exp] cmd= run_nn_script=run_nn out_folder=exp/lrs2_liGRU_fmllr seed=1234 use_cuda=True multi_gpu=False save_gpumem=False N_epochs_tr=24

[dataset1] data_name=pretrain_train fea:fea_name=fmllr fea_lst=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/pretrain_train/feats.scp fea_opts=apply-cmvn --utt2spk=ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/pretrain_train/utt2spk ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/pretrain_train/data/cmvn_pretrain_train.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=0 cw_right=0

lab:lab_name=lab_cd lab_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali/ lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/pretrain_train/ lab_graph=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/

N_chunks=200

[dataset2] data_name=dev fea:fea_name=fmllr fea_lst=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/dev/feats.scp fea_opts=apply-cmvn --utt2spk=ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/dev/utt2spk ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/dev/data/cmvn_dev.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=0 cw_right=0

lab:lab_name=lab_cd lab_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali_dev lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/dev/ lab_graph=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/

N_chunks=10

[dataset3] data_name=eval fea:fea_name=fmllr fea_lst=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/eval/feats.scp fea_opts=apply-cmvn --utt2spk=ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/eval/utt2spk ark:/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/eval/data/cmvn_eval.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- | cw_left=0 cw_right=0

lab:lab_name=lab_cd lab_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt_ali_eval lab_opts=ali-to-pdf lab_count_file=auto lab_data_folder=/home/wentao/pytorch_kaldi/kaldi_prepare/data-fmllr-tri4/eval/ lab_graph=/home/wentao/pytorch_kaldi/kaldi_prepare/exp/tri4a_pt/graph/

N_chunks=8

[data_use] train_with=pretrain_train valid_with=dev forward_with=eval

[batches] batch_size_train=16 max_seq_length_train=500 increase_seq_length_train=True start_seq_len_train=100 multply_factor_seq_len_train=2 batch_size_valid=8 max_seq_length_valid=1000

[architecture1] arch_name = liGRU_layers arch_proto = proto/liGRU.proto arch_library = neural_networks arch_class = liGRU arch_pretrain_file = none arch_freeze = False arch_seq_model = True

ligru_lay = 550,550,550,550,550 ligru_drop = 0.2,0.2,0.2,0.2,0.2 ligru_use_laynorm_inp = False ligru_use_batchnorm_inp = False ligru_use_laynorm = False,False,False,False,False ligru_use_batchnorm = True,True,True,True,True ligru_bidir = True ligru_act = relu,relu,relu,relu,relu ligru_orthinit=True

arch_lr = 0.0002 arch_halving_factor = 0.5 arch_improvement_threshold = 0.001 arch_opt = rmsprop opt_momentum = 0.0 opt_alpha = 0.95 opt_eps = 1e-8 opt_centered = False opt_weight_decay = 0.0

[architecture2] arch_name=MLP_layers arch_proto=proto/MLP.proto arch_library=neural_networks arch_class=MLP arch_pretrain_file=none arch_freeze=False arch_seq_model=False dnn_lay=N_out_lab_cd dnn_drop=0.0 dnn_use_laynorm_inp=False dnn_use_batchnorm_inp=False dnn_use_batchnorm=False dnn_use_laynorm=False dnn_act=softmax

arch_lr=0.0002 arch_halving_factor=0.5 arch_improvement_threshold=0.001 arch_opt=rmsprop opt_momentum=0.0 opt_alpha=0.95 opt_eps=1e-8 opt_centered=False opt_weight_decay=0.0

[model] model_proto=proto/model.proto model:out_dnn1=compute(liGRU_layers,fmllr) out_dnn2=compute(MLP_layers,out_dnn1) loss_final=cost_nll(out_dnn2,lab_cd) err_final=cost_err(out_dnn2,lab_cd)

[forward] forward_out=out_dnn2 normalize_posteriors=True normalize_with_counts_from=lab_cd save_out_file=False require_decoding=True

[decoding] decoding_script_folder=kaldi_decoding_scripts decoding_script=decode_dnn.sh decoding_proto=proto/decoding.proto min_active=200 max_active=7000 max_mem=50000000 beam=20.0 latbeam=12.0 acwt=0.10 max_arcs=-1 skip_scoring=false scoring_script=/home/wentao/pytorch_kaldi/kaldi_prepare/local/score.sh scoring_opts="--min-lmwt 4 --max-lmwt 23" norm_vars=False

Johe-cqu commented 5 years ago

Hi, @wentaoxandry

Why do you set cw_left=0,cw_right=0,add-deltas --delta-order=0. I think it may cause “ No tokens alive [doing pruning]” . So Can you set cw_left=5,cw_right=5,add-deltas --delta-order=2 ,and try it again?

John

wentaoxandry commented 5 years ago

Hallo, @Johe-cqu

Thanks, I'm using fmllr as features, I thought it not necessary use context and add deltas again. But I will try your advice. Thank you very much, it's very helpful.

Wentao

Johe-cqu commented 5 years ago

Hi, @wentaoxandry

Please, let me know if everything is ok.

John

wentaoxandry commented 5 years ago

Hallo, @Johe-cqu,

I will first try, decode with cw_left=0,cw_right=0,add-deltas=0, I think I will get the results tomorrow. If it not working, I will try your advice. I will tell you if everything is ok as soon as possible.

Wentao

mravanelli commented 5 years ago

Hi, I don't think cw_left=0,cw_right=0,add-deltas=0 could be a problem. I think that a possible cause is that training didn't go well. Could you please post the res.res file in the output_folder?

On Tue, 12 Mar 2019 at 09:40, wentaoxandry notifications@github.com wrote:

Hallo, @Johe-cqu https://github.com/Johe-cqu,

I will first try, decode with cw_left=0,cw_right=0,add-deltas=0, I think I will get the results tomorrow. If it not working, I will try your advice. I will tell you if everything is ok as soon as possible.

Wentao

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/66#issuecomment-472004523, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1l-6ygcvJQMVuTqllUSJEgOOIdjKks5vV65NgaJpZM4bqPYr .

wentaoxandry commented 5 years ago

Hallo @mravanelli ,

Thank you for your reply, here is res.res file:

ep=00 tr=['pretrain_train'] loss=2.532 err=0.573 valid=dev loss=1.701 err=0.432 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=27674 ep=01 tr=['pretrain_train'] loss=1.980 err=0.476 valid=dev loss=1.467 err=0.381 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=27512 ep=02 tr=['pretrain_train'] loss=1.779 err=0.437 valid=dev loss=1.372 err=0.360 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=36300 ep=03 tr=['pretrain_train'] loss=1.686 err=0.418 valid=dev loss=1.331 err=0.351 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40374 ep=04 tr=['pretrain_train'] loss=1.640 err=0.409 valid=dev loss=1.295 err=0.342 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=44608 ep=05 tr=['pretrain_train'] loss=1.608 err=0.402 valid=dev loss=1.281 err=0.340 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=54788 ep=06 tr=['pretrain_train'] loss=1.582 err=0.397 valid=dev loss=1.269 err=0.337 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=51903 ep=07 tr=['pretrain_train'] loss=1.561 err=0.393 valid=dev loss=1.245 err=0.333 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=43224 ep=08 tr=['pretrain_train'] loss=1.544 err=0.389 valid=dev loss=1.241 err=0.332 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=43300 ep=09 tr=['pretrain_train'] loss=1.529 err=0.386 valid=dev loss=1.232 err=0.330 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40543 ep=10 tr=['pretrain_train'] loss=1.516 err=0.384 valid=dev loss=1.231 err=0.330 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40526 ep=11 tr=['pretrain_train'] loss=1.480 err=0.376 valid=dev loss=1.206 err=0.323 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=40495 ep=12 tr=['pretrain_train'] loss=1.496 err=0.380 valid=dev loss=1.218 err=0.326 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40535 ep=13 tr=['pretrain_train'] loss=1.464 err=0.373 valid=dev loss=1.202 err=0.322 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=40550 ep=14 tr=['pretrain_train'] loss=1.481 err=0.376 valid=dev loss=1.209 err=0.323 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=40512 ep=15 tr=['pretrain_train'] loss=1.451 err=0.370 valid=dev loss=1.193 err=0.320 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=40586 ep=16 tr=['pretrain_train'] loss=1.469 err=0.374 valid=dev loss=1.204 err=0.322 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=53597 ep=17 tr=['pretrain_train'] loss=1.440 err=0.368 valid=dev loss=1.196 err=0.320 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=47637 ep=18 tr=['pretrain_train'] loss=1.459 err=0.372 valid=dev loss=1.201 err=0.320 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=47428 ep=19 tr=['pretrain_train'] loss=1.431 err=0.366 valid=dev loss=1.187 err=0.318 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=48562 ep=20 tr=['pretrain_train'] loss=1.451 err=0.370 valid=dev loss=1.203 err=0.321 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=50639 ep=21 tr=['pretrain_train'] loss=1.423 err=0.364 valid=dev loss=1.181 err=0.317 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=54642 ep=22 tr=['pretrain_train'] loss=1.443 err=0.369 valid=dev loss=1.194 err=0.319 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=48711 ep=23 tr=['pretrain_train'] loss=nan err=0.863 valid=dev loss=nan err=0.907 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=46011 %WER -nan [ 0 / 0, 0 ins, 0 del, 0 sub ] [PARTIAL] /home/wentao/pytorch_kaldi/pytorch-kaldi/exp/lrs2_liGRU_fmllr/decode_eval_out_dnn2/wer_10_0.0

TParcollet commented 5 years ago

From the res.res file we can see that the last losses are nan. The problem comes from the training. You don't need cw_left and cw_right (you are using RNNs i.e not mandatory) same for the derivarives (deltas) not needed for FMLLR feats. I suggest you to restart from scratch with a clean exp directory, and to see if the problem still arises at the end.

TParcollet commented 5 years ago

Please, also consider updating your version of pytorch-kaldi, we recently fixed a bug with the learning rates.

mravanelli commented 5 years ago

Yes, everything went well except the final epoch. Also, I would suggest you to update your pytorch-kaldi version. We recently fix a learning rate issue that could be the cause of the problem. Please, keep us updated!

Mirco

On Tue, 12 Mar 2019 at 10:15, Parcollet Titouan notifications@github.com wrote:

From the res.res file we can see that the last losses are nan. The problem comes from the training. You don't need cw_left and cw_right (you are using RNNs i.e not mandatory) same for the derivarives (deltas) not needed for FMLLR feats. I suggest you to restart from scratch with a clean exp directory, and to see if the problem still arise at the end.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/66#issuecomment-472018413, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1oD4EtiAwgoWd-e2izA75HSyyYj8ks5vV7Z6gaJpZM4bqPYr .

wentaoxandry commented 5 years ago

Hallo @mravanelli @TParcollet ,

I have tried again, but I still got loss = nan, but don't like last time, nan only at the final epoch. This time nan is at epoch 18. Here is res.res: ep=00 tr=['pretrain_train'] loss=2.533 err=0.573 valid=dev loss=1.686 err=0.427 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=16923 ep=01 tr=['pretrain_train'] loss=1.980 err=0.476 valid=dev loss=1.464 err=0.380 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15644 ep=02 tr=['pretrain_train'] loss=1.778 err=0.436 valid=dev loss=1.370 err=0.359 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=21928 ep=03 tr=['pretrain_train'] loss=1.685 err=0.418 valid=dev loss=1.329 err=0.351 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=25619 ep=04 tr=['pretrain_train'] loss=1.640 err=0.409 valid=dev loss=1.300 err=0.344 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=25889 ep=05 tr=['pretrain_train'] loss=1.607 err=0.402 valid=dev loss=1.280 err=0.340 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=26456 ep=06 tr=['pretrain_train'] loss=1.581 err=0.397 valid=dev loss=1.273 err=0.337 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28385 ep=07 tr=['pretrain_train'] loss=1.561 err=0.393 valid=dev loss=1.250 err=0.334 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28423 ep=08 tr=['pretrain_train'] loss=1.543 err=0.389 valid=dev loss=1.241 err=0.331 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28375 ep=09 tr=['pretrain_train'] loss=1.529 err=0.386 valid=dev loss=1.223 err=0.327 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28502 ep=10 tr=['pretrain_train'] loss=1.516 err=0.384 valid=dev loss=1.213 err=0.325 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=29509 ep=11 tr=['pretrain_train'] loss=1.506 err=0.382 valid=dev loss=1.200 err=0.321 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=28789 ep=12 tr=['pretrain_train'] loss=1.495 err=0.380 valid=dev loss=1.198 err=0.320 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=24812 ep=13 tr=['pretrain_train'] loss=1.487 err=0.378 valid=dev loss=1.184 err=0.318 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=24025 ep=14 tr=['pretrain_train'] loss=1.478 err=0.376 valid=dev loss=1.197 err=0.319 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=23997 ep=15 tr=['pretrain_train'] loss=1.445 err=0.369 valid=dev loss=1.177 err=0.314 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=24035 ep=16 tr=['pretrain_train'] loss=231381451.427 err=0.367 valid=dev loss=1.179 err=0.315 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=24003 ep=17 tr=['pretrain_train'] loss=1.448 err=0.364 valid=dev loss=1.185 err=0.315 lr_architecture1=5e-05 lr_architecture2=5e-05 time(s)=24128 ep=18 tr=['pretrain_train'] loss=nan err=0.514 valid=dev loss=nan err=0.904 lr_architecture1=2.5e-05 lr_architecture2=2.5e-05 time(s)=24035 ep=19 tr=['pretrain_train'] loss=nan err=0.956 valid=dev loss=nan err=0.904 lr_architecture1=1.25e-05 lr_architecture2=1.25e-05 time(s)=23997

TParcollet commented 5 years ago

Ok the very weird stuff is actually happening at ep 16, the train loss just exploded, causing the oscillation and the final nan. But why ... @mravanelli do you think that we could have a divide by zero somewhere ? Like during the normalisation or something ? Or a problem with batch_norm and ReLU ?

Could you try to run this experiment with a very simple MLP (that should be way faster). If it also explodes, it is because of the data. If it does not, it might be because of the architecture.

wentaoxandry commented 5 years ago

@TParcollet ok, I will try that

mravanelli commented 5 years ago

Hi, you are probably experimenting some numerical instabilities. Are you using the Li-GRU model, right? In all my past experiments I was able to get rid of these numerical issues coupling the ReLU activation with batch normalization (this is the model you are currently using, according to the config file you sent us). This trick helps a lot, but cannot give 100% guarantee of numerical stability. As suggested by Titouan you can try with a MLP model and make sure that everything is working with that. Then, you can try to run a recurrent neural network with the standard GRU model, that is very numerical stable. If you still want to try the Li-GRU model, you might try to uncomment the lines in "core.py" related to gradient clipping. Gradient clipping is normally not needed, but can potentially help in your case.

Please, keep us updated!

Best,

Mirco

On Tue, 19 Mar 2019 at 07:18, wentaoxandry notifications@github.com wrote:

@TParcollet https://github.com/TParcollet ok, I will try that

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/66#issuecomment-474315788, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGs1uTWXioCl6q6UhREEbSxezT7mhgOks5vYMdxgaJpZM4bqPYr .

wentaoxandry commented 5 years ago

@mravanelli Thank you for your help, I will try it.