Problem in decoding new dataset - decode_dnn.sh not executed

Hi! First of all, thanks for this great repository. It is well documented and extremely useful to investigate the state-of-the-art in end-to-end / seq2seq models for ASR.

I am trying to apply the overall ASR pipeline on my own dataset, which is a karaoke (singing voice) dataset with lyrics, and it has a format in a way to be processed using the Kaldi Toolkit. I have tried out few recipes within the toolkit, including DNN setups, and worked fine. Nowadays, I have been trying to apply the end-to-end systems provided in this repository.

Currently, I have been facing problems in the decoding procedure, which I share the details below. To summarize though, the training & validation processes are accomplished without any errors in the log files, yet the skips the decoding part automatically. Below, I share the screenshots of the print outs in the terminal screen. Moreover, I also share my log file.

According to my investigation on the issue so far, my intuition is that the 'kaldi_decoding_scripts/decode_dnn.sh' script is not executed. Do you have any idea what can be the cause for this?

Kind regards,

Here is the screenshot:

sc1

Here is part my log file (similar warnings appear through the log file) : WARNING (latgen-faster-mapped[5.5.324~1-f267]:DeterminizeLatticePruned():determinize-lattice-pruned.cc:1280) Effective beam 1.90675 was less than beam 12 cutoff 0.5, pruning raw lattice with new beam 6 and retrying. LOG (latgen-faster-mapped[5.5.324~1-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.324~1-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.324~1-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. WARNING (latgen-faster-mapped[5.5.324~1-f267]:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (31839168,11136,21878496), after rebuilding, repo size was 20536000, effective beam was 4.01572 vs. requested beam 6 WARNING (latgen-faster-mapped[5.5.324~1-f267]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:277) Determinization finished earlier than the beam for utterance F1363335124-408330101_112743-825492545_1617581314-GB-F-009 LOG (latgen-faster-mapped[5.5.324~1-f267]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) Log-like per frame for utterance F1363335124-408330101_112743-825492545_1617581314-GB-F-009 is 0.0850855 over 1726 frames. M742835985-366321445_101397-423278228_1606190422-GB-M-014 LOG (latgen-faster-mapped[5.5.324~1-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.324~1-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. WARNING (latgen-faster-mapped[5.5.324~1-f267]:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (32808320,2112,29604480), after rebuilding, repo size was 18846112, effective beam was 1.18036 vs. requested beam 12 WARNING (latgen-faster-mapped[5.5.324~1-f267]:DeterminizeLatticePruned():determinize-lattice-pruned.cc:1280) Effective beam 1.18036 was less than beam 12 cutoff 0.5, pruning raw lattice with new beam 6 and retrying. LOG (latgen-faster-mapped[5.5.324~1-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.324~1-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.324~1-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. WARNING (latgen-faster-mapped[5.5.324~1-f267]:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (29341920,11136,21124488), after rebuilding, repo size was 20339072, effective beam was 4.28016 vs. requested beam 6 WARNING (latgen-faster-mapped[5.5.324~1-f267]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:277) Determinization finished earlier than the beam for utterance M742835985-366321445_101397-423278228_1606190422-GB-M-014 LOG (latgen-faster-mapped[5.5.324~1-f267]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) Log-like per frame for utterance M742835985-366321445_101397-423278228_1606190422-GB-M-014 is 0.136956 over 2311 frames. LOG (latgen-faster-mapped[5.5.324~1-f267]:main():latgen-faster-mapped.cc:164) Time taken 2073.1s: real-time factor assuming 100 frames/sec is 5.44893 LOG (latgen-faster-mapped[5.5.324~1-f267]:main():latgen-faster-mapped.cc:167) Done 59 utterances, failed for 0 LOG (latgen-faster-mapped[5.5.324~1-f267]:main():latgen-faster-mapped.cc:169) Overall log-likelihood per frame is 0.0828848 over 38046 frames.

Hi! Ok, I think you will have to be a bit more precise on what you have implemented so we can help you. Pytorch-Kaldi does not provide any E2E recipe. Have you implemented a custom CTC loss or something like this?

Many thanks for your timely response. I was mistaken to use the term E2E. I have tried to use the hybrid models using the LiGRU/LSTM/RNN based models that are available in the toolkit, to get the posterior probabilities of the phone states. I have a pretrained HMM graph (a triphone model using SAT training) to decode. Sorry for any cause of misunderstanding.

Hi Emir, probably your task is not that standard and might require different decoding hyperparameters (I suggest to use the same your are using in the kaldi repository). Could you post the res.res file? This can help us making sure that training and validation are fine...

Mirco

On Tue, 1 Oct 2019 at 10:13, Emir Demirel notifications@github.com wrote:

Hi! First of all, thanks for this great repository. It is well documented and extremely useful to investigate the state-of-the-art in end-to-end / seq2seq models for ASR.

I am trying to apply the overall ASR pipeline on my own dataset, which is a karaoke (singing voice) dataset with lyrics, and it has a format in a way to be processed using the Kaldi Toolkit. I have tried out few recipes within the toolkit, including DNN setups, and worked fine. Nowadays, I have been trying to apply the end-to-end systems provided in this repository.

Currently, I have been facing problems in the decoding procedure, which I share the details below. To summarize though, the training & validation processes are accomplished without any errors in the log files, yet the skips the decoding part automatically. Below, I share the screenshots of the print outs in the terminal screen. Moreover, I also share my log file.

According to my investigation on the issue so far, my intuition is that the 'kaldi_decoding_scripts/decode_dnn.sh' script is not executed. Do you have any idea what can be the cause for this?

Kind regards,

Here is the screenshot:

[image: sc1] https://user-images.githubusercontent.com/23708924/65969801-b0c45f00-e465-11e9-83b4-772b6eddd342.png

Here is part my log file (similar warnings appear through the log file) : WARNING (latgen-faster-mapped[5.5.3241-f267]:DeterminizeLatticePruned():determinize-lattice-pruned.cc:1280) Effective beam 1.90675 was less than beam 12 cutoff 0.5, pruning raw lattice with new beam 6 and retrying. LOG (latgen-faster-mapped[5.5.3241-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.3241-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.3241-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. WARNING (latgen-faster-mapped[5.5.3241-f267]:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (31839168,11136,21878496), after rebuilding, repo size was 20536000, effective beam was 4.01572 vs. requested beam 6 WARNING (latgen-faster-mapped[5.5.3241-f267]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:277) Determinization finished earlier than the beam for utterance F1363335124-408330101_112743-825492545_1617581314-GB-F-009 LOG (latgen-faster-mapped[5.5.3241-f267]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) Log-like per frame for utterance F1363335124-408330101_112743-825492545_1617581314-GB-F-009 is 0.0850855 over 1726 frames. M742835985-366321445_101397-423278228_1606190422-GB-M-014 LOG (latgen-faster-mapped[5.5.3241-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.3241-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. WARNING (latgen-faster-mapped[5.5.3241-f267]:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (32808320,2112,29604480), after rebuilding, repo size was 18846112, effective beam was 1.18036 vs. requested beam 12 WARNING (latgen-faster-mapped[5.5.3241-f267]:DeterminizeLatticePruned():determinize-lattice-pruned.cc:1280) Effective beam 1.18036 was less than beam 12 cutoff 0.5, pruning raw lattice with new beam 6 and retrying. LOG (latgen-faster-mapped[5.5.3241-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.3241-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. LOG (latgen-faster-mapped[5.5.3241-f267]:RebuildRepository():determinize-lattice-pruned.cc:283) Rebuilding repository. WARNING (latgen-faster-mapped[5.5.3241-f267]:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (29341920,11136,21124488), after rebuilding, repo size was 20339072, effective beam was 4.28016 vs. requested beam 6 WARNING (latgen-faster-mapped[5.5.3241-f267]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:277) Determinization finished earlier than the beam for utterance M742835985-366321445_101397-423278228_1606190422-GB-M-014 LOG (latgen-faster-mapped[5.5.3241-f267]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) Log-like per frame for utterance M742835985-366321445_101397-423278228_1606190422-GB-M-014 is 0.136956 over 2311 frames. LOG (latgen-faster-mapped[5.5.3241-f267]:main():latgen-faster-mapped.cc:164) Time taken 2073.1s: real-time factor assuming 100 frames/sec is 5.44893 LOG (latgen-faster-mapped[5.5.3241-f267]:main():latgen-faster-mapped.cc:167) Done 59 utterances, failed for 0 LOG (latgen-faster-mapped[5.5.3241-f267]:main():latgen-faster-mapped.cc:169) Overall log-likelihood per frame is 0.0828848 over 38046 frames.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/169?email_source=notifications&email_token=AEA2ZVWMBSIINTOOJYG4IPLQMNLILA5CNFSM4I4KDUX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HO3QV5Q, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA2ZVT4XYURPXJCSR33OXTQMNLILANCNFSM4I4KDUXQ .

Thanks for the suggestion. I will try to use same parameters as in the kaldi repository. Perhaps a lower value for beam and lat_beam parameters.

Here is what I have in the res.res file:

ep=00 tr=['train30'] loss=0.897 err=0.310 valid=dev loss=0.964 err=0.336 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15510
ep=01 tr=['train30'] loss=0.762 err=0.269 valid=dev loss=0.829 err=0.294 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=13772
ep=02 tr=['train30'] loss=0.706 err=0.250 valid=dev loss=0.841 err=0.293 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15380
ep=03 tr=['train30'] loss=0.680 err=0.241 valid=dev loss=0.826 err=0.290 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15947
ep=04 tr=['train30'] loss=0.669 err=0.237 valid=dev loss=0.991 err=0.306 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15978
ep=05 tr=['train30'] loss=0.642 err=0.229 valid=dev loss=1.112 err=0.302 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15952
ep=06 tr=['train30'] loss=0.635 err=0.226 valid=dev loss=1.088 err=0.297 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15933
ep=07 tr=['train30'] loss=0.627 err=0.224 valid=dev loss=0.948 err=0.294 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15952
ep=08 tr=['train30'] loss=0.625 err=0.223 valid=dev loss=0.969 err=0.292 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15949
ep=09 tr=['train30'] loss=0.621 err=0.222 valid=dev loss=1.086 err=0.295 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15939
ep=10 tr=['train30'] loss=0.612 err=0.218 valid=dev loss=1.060 err=0.297 lr_architecture1=5e-05 lr_architecture2=5e-05 time(s)=15932
ep=11 tr=['train30'] loss=0.609 err=0.215 valid=dev loss=1.149 err=0.303 lr_architecture1=2.5e-05 lr_architecture2=2.5e-05 time(s)=15949
ep=12 tr=['train30'] loss=0.794 err=0.214 valid=dev loss=1.106 err=0.297 lr_architecture1=1.25e-05 lr_architecture2=1.25e-05 time(s)=15965
ep=13 tr=['train30'] loss=0.613 err=0.213 valid=dev loss=1.066 err=0.292 lr_architecture1=1.25e-05 lr_architecture2=1.25e-05 time(s)=15959
ep=14 tr=['train30'] loss=0.594 err=0.213 valid=dev loss=1.059 err=0.293 lr_architecture1=1.25e-05 lr_architecture2=1.25e-05 time(s)=15974
ep=15 tr=['train30'] loss=0.593 err=0.212 valid=dev loss=1.118 err=0.298 lr_architecture1=6.25e-06 lr_architecture2=6.25e-06 time(s)=17503
ep=16 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.069 err=0.295 lr_architecture1=3.125e-06 lr_architecture2=3.125e-06 time(s)=20445
ep=17 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.133 err=0.296 lr_architecture1=3.125e-06 lr_architecture2=3.125e-06 time(s)=16000
ep=18 tr=['train30'] loss=0.593 err=0.211 valid=dev loss=1.070 err=0.292 lr_architecture1=1.5625e-06 lr_architecture2=1.5625e-06 time(s)=16006
ep=19 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.084 err=0.293 lr_architecture1=1.5625e-06 lr_architecture2=1.5625e-06 time(s)=16026
ep=20 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.102 err=0.293 lr_architecture1=7.8125e-07 lr_architecture2=7.8125e-07 time(s)=16031
ep=21 tr=['train30'] loss=0.596 err=0.211 valid=dev loss=1.092 err=0.292 lr_architecture1=3.90625e-07 lr_architecture2=3.90625e-07 time(s)=16062
ep=22 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.101 err=0.292 lr_architecture1=3.90625e-07 lr_architecture2=3.90625e-07 time(s)=16032
ep=23 tr=['train30'] loss=0.591 err=0.211 valid=dev loss=1.089 err=0.294 lr_architecture1=3.90625e-07 lr_architecture2=3.90625e-07 time(s)=16057

As you can see, after around 12th epoch, the learning reaches a plateau, so perhaps it might be a good idea to stop training there.

However, the code didn't produce WER.

Hi, I think you should try to reduce the learning rate as well (maybe divide the current one by 10) because it seems that the validation error doesn't go down after epoch 2 (you could be in overfitting regime).

On Wed, 2 Oct 2019 at 03:54, Emir Demirel notifications@github.com wrote:

Thanks for the suggestion. I will try to use same parameters as in the kaldi repository. Perhaps a lower value for beam and lat_beam parameters.

Here is what I have in the res.res file:

ep=00 tr=['train30'] loss=0.897 err=0.310 valid=dev loss=0.964 err=0.336 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15510 ep=01 tr=['train30'] loss=0.762 err=0.269 valid=dev loss=0.829 err=0.294 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=13772 ep=02 tr=['train30'] loss=0.706 err=0.250 valid=dev loss=0.841 err=0.293 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15380 ep=03 tr=['train30'] loss=0.680 err=0.241 valid=dev loss=0.826 err=0.290 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15947 ep=04 tr=['train30'] loss=0.669 err=0.237 valid=dev loss=0.991 err=0.306 lr_architecture1=0.0002 lr_architecture2=0.0002 time(s)=15978 ep=05 tr=['train30'] loss=0.642 err=0.229 valid=dev loss=1.112 err=0.302 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15952 ep=06 tr=['train30'] loss=0.635 err=0.226 valid=dev loss=1.088 err=0.297 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15933 ep=07 tr=['train30'] loss=0.627 err=0.224 valid=dev loss=0.948 err=0.294 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15952 ep=08 tr=['train30'] loss=0.625 err=0.223 valid=dev loss=0.969 err=0.292 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15949 ep=09 tr=['train30'] loss=0.621 err=0.222 valid=dev loss=1.086 err=0.295 lr_architecture1=0.0001 lr_architecture2=0.0001 time(s)=15939 ep=10 tr=['train30'] loss=0.612 err=0.218 valid=dev loss=1.060 err=0.297 lr_architecture1=5e-05 lr_architecture2=5e-05 time(s)=15932 ep=11 tr=['train30'] loss=0.609 err=0.215 valid=dev loss=1.149 err=0.303 lr_architecture1=2.5e-05 lr_architecture2=2.5e-05 time(s)=15949 ep=12 tr=['train30'] loss=0.794 err=0.214 valid=dev loss=1.106 err=0.297 lr_architecture1=1.25e-05 lr_architecture2=1.25e-05 time(s)=15965 ep=13 tr=['train30'] loss=0.613 err=0.213 valid=dev loss=1.066 err=0.292 lr_architecture1=1.25e-05 lr_architecture2=1.25e-05 time(s)=15959 ep=14 tr=['train30'] loss=0.594 err=0.213 valid=dev loss=1.059 err=0.293 lr_architecture1=1.25e-05 lr_architecture2=1.25e-05 time(s)=15974 ep=15 tr=['train30'] loss=0.593 err=0.212 valid=dev loss=1.118 err=0.298 lr_architecture1=6.25e-06 lr_architecture2=6.25e-06 time(s)=17503 ep=16 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.069 err=0.295 lr_architecture1=3.125e-06 lr_architecture2=3.125e-06 time(s)=20445 ep=17 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.133 err=0.296 lr_architecture1=3.125e-06 lr_architecture2=3.125e-06 time(s)=16000 ep=18 tr=['train30'] loss=0.593 err=0.211 valid=dev loss=1.070 err=0.292 lr_architecture1=1.5625e-06 lr_architecture2=1.5625e-06 time(s)=16006 ep=19 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.084 err=0.293 lr_architecture1=1.5625e-06 lr_architecture2=1.5625e-06 time(s)=16026 ep=20 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.102 err=0.293 lr_architecture1=7.8125e-07 lr_architecture2=7.8125e-07 time(s)=16031 ep=21 tr=['train30'] loss=0.596 err=0.211 valid=dev loss=1.092 err=0.292 lr_architecture1=3.90625e-07 lr_architecture2=3.90625e-07 time(s)=16062 ep=22 tr=['train30'] loss=0.592 err=0.212 valid=dev loss=1.101 err=0.292 lr_architecture1=3.90625e-07 lr_architecture2=3.90625e-07 time(s)=16032 ep=23 tr=['train30'] loss=0.591 err=0.211 valid=dev loss=1.089 err=0.294 lr_architecture1=3.90625e-07 lr_architecture2=3.90625e-07 time(s)=16057

As you can see, after around 12th epoch, the learning reaches a plateau, so perhaps it might be a good idea to stop training there.

However, the code didn't produce WER.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/169?email_source=notifications&email_token=AEA2ZVRE6CZCSRUTO4UJAUDQMRHUNA5CNFSM4I4KDUX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAD4NKY#issuecomment-537380523, or mute the thread https://github.com/notifications/unsubscribe-auth/AEA2ZVULMOAFCOFDQSXIMJDQMRHUNANCNFSM4I4KDUXQ .

Thanks for the training suggestion, I agree that the training is necessary with different hyperparameters. Yet, I don't think the problem with decoding script was related to this. The problem was that decode_dnn.sh was not executed, and I was able to solve to problem. In my system, the problem was with utils/run_shell function. I fixed it by adding following to run_exp.py

import subprocess, shlex

then to execute run_dnn.sh (instead of run_shell(cmd_decode,log_file) );

subprocess.call(shlex.split(cmd_decode))

And finally, the decoding script is executed. However, I am facing another problem now. For each and every utterance, I get the following error:

/homes/ed308/pytorch-kaldi/kaldi_decoding_scripts/decode_dnn.sh: line 94: 33408 Aborted (core dumped) latgen-faster-mapped$thread_string --min-active=$min_active --max-active=$max_active --max-mem=$max_mem --beam=$beam --lattice-beam=$latbeam --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt $alidir/final.mdl $graphdir/HCLG.fst "$finalfeats" "ark:|gzip -c > $dir/lat.$JOB.gz" &>$dir/log/decode.$JOB.log

Do you think it is related to memory usage?

Many thanks for the answers

Alright, so after a lot of effort, I was able to run the pytorch-kaldi pipeline successfully and to achieve initial results. Below, I explain the steps I took to fix the problems mentioned above.

First, as mentioned above, the run_dnn.sh script was not executed somehow. I fixed it by replacing run_shell(cmd_decode,log_file) line with subprocess.call(shlex.split(cmd_decode)). Now the code is executed. Don't forget to import subprocess and shlex.
Second, during decoding, I had been seeing <UNK> labels for each and every utterance in the log file. This was an indication that something went wrong in the label processing (possibly due to Language models). I fixed the issue by starting from scratch and running the whole Kaldi data preparation, feature extraction and GMM-HMM (tri3b) training. After carefully executing each step, the label problem seemed to dissappear.
Finally, I still had errors in decoding. As I was following Librispeech tutorial, I changed the default scoring script, which I presume, was written for the TIMIT example. I simply modified the relevant line in the configuration file (.cfg) (score_basic.sh ). Previous version :scoring_script = local/score.sh Modified version : scoring_script = local/score_basic.sh

Even with one epoch of training with MLPs (which is not the best option, but a fast one to try), I achieved similar results with Triphone SAT / GMM-HMM training on fMLLR features.

Thanks for all the answers and support. I hope this post would be helpful for those who struggle with similar problems.

Cheers

mravanelli / pytorch-kaldi

Problem in decoding new dataset - decode_dnn.sh not executed #169