mravanelli / pytorch-kaldi

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.
2.37k stars 446 forks source link

Librispeech: Adding lattice rescoring #232

Closed omprakashsonie closed 4 years ago

omprakashsonie commented 4 years ago

Hi Ravanelli, Following steps are mentioned on GitHub: You can improve the performance by adding lattice rescoring in this way (run it from the kaldi_decoding_script folder of Pytorch-Kaldi):

data_dir=/data/milatmp1/ravanelm/librispeech/s5/data/ dec_dir=/u/ravanelm/pytorch-Kaldi-new/exp/libri_fmllr/decode_test_clean_out_dnn1/ out_dir=/u/ravanelm/pytorch-kaldi-new/exp/libri_fmllr/

steps/lmrescore_const_arpa.sh $data_dir/langtest{tgsmall,fglarge} \ $data_dir/test_clean $dec_dir $out_dir/decode_test_clean_fglarge || exit 1;

But I don't see following:

  1. exp/libri_fmllr What should be in this directory?

I see fuller in folder ~/kaldi/egs/librispeech/s5-960

  1. There is no steps directory in kaldi_decoding_script folder. It has following: ~/pytorch-kaldi/kaldi_decoding_scripts$ ls cmd.sh conf decode_dnn.sh local parse_options.sh path.sh split_data.sh utils

steps/lmrescore_const_arpa.sh is in folder: ~/kaldi/egs/librispeech/s5-960

Should give full path for steps/lmrescore_const_arpa.sh ?

3. I had run upto 13 steps of Kaldi on 960hrs data

Does it mean the following steps I ran for Librispeech were on 960hrs of data and I should compare numbers with 960hrs experiment?

Any help will be appreciated.

omprakashsonie commented 4 years ago

Hi Ravanelli, Any help will be appreciated.

TParcollet commented 4 years ago

Hi,

Sorry for the delay. Did you change the path in data_dir=/data/milatmp1/ravanelm/librispeech/s5/data/ dec_dir=/u/ravanelm/pytorch-Kaldi-new/exp/libri_fmllr/decode_test_clean_out_dnn1/ out_dir=/u/ravanelm/pytorch-kaldi-new/exp/libri_fmllr/

Accordingly to your setup?

omprakashsonie commented 4 years ago

I am just starting/

  1. will change: data_dir=librispeech/s5/data/

  2. The question is for dec_dir=/exp/libri_fmllr/decode_test_clean_out_dnn1/

there is no 'libri-fmllr' in directory 'exp' and no further sub-directory 'decode_test_clean_out_dnn1'

How are these created and what is stored in it?

  1. Will change path once I get through step 2 out_dir=/u/ravanelm/pytorch-kaldi-new/exp/libri_fmllr/

TParcollet commented 4 years ago

I am sorry but I don't understand where you current are.

Did you do step one: Run the Kaldi recipe, and you are able to see all the correct dirs created on exp/ ? Such as mono, tri2, tri3 etc etc ?

omprakashsonie commented 4 years ago

I have completed following:

image

Now trying to follow these steps:

image
TParcollet commented 4 years ago

So I'm pretty sure that steps/lmrescore_const_arpa.sh is called from the Kaldi directory. You can basically call it from here OR specify the full steps/lmrescore_const_arpa.sh path (Kaldi one)

omprakashsonie commented 4 years ago

ok, will provide full path for steps/lmrescore...

As 'libri_fmllr' and 'decode_test_clean_out_dnn1' directories don't exist, should I create them and run lmrescore..?

TParcollet commented 4 years ago

No, they should be created by steps 3.

omprakashsonie commented 4 years ago

Thanks a lot TParcollet for your inputs.

I was looking for directory 'libri_fmllr' in pytorch-kaldi 'exp' directory.

Looks like name has changed to 'libri_MLP_fmllr'

After correcting directory: data_dir=/home/omprakash.s/pytorch-kaldi/exp/libri_MLP_fmllr

dec_dir=/home/omprakash.s/pytorch-kaldi/exp/libri_MLP_fmllr/decode_test_clean_out_dnn1

out_dir=/home/omprakash.s/pytorch-kaldi/exp/libri_MLP_fmllr/

/home/omprakash.s/kaldi/egs/librispeech/s5-960/steps/lmrescore_const_arpa.sh $data_dir/langtest{tgsmall,fglarge} $data_dir/test_clean $dec_dir $out_dir/decode_test_clean_fglarge || exit 1;

Getting following error: Error for following directories and files:

  1. libri_MLP_fmllr/lang_test_tgsmall/words.txt: No such file or directory
  2. Missing file /home/omprakash.s/pytorch-kaldi/exp/libri_MLP_fmllr/lang_test_tgsmall/G.fst
  3. libri_MLP_fmllr/lang_test_fglarge/words.txt: No such file or directory
  4. Missing file /home/omprakash.s/pytorch-kaldi/exp/libri_MLP_fmllr/lang_test_fglarge/G.carpa

After copying these from kaldi got error: score.sh: no such file /home/omprakash.s/pytorch-kaldi/exp/libri_MLP_fmllr/test_clean/text

omprakashsonie commented 4 years ago

Any help will be appreciated.

TParcollet commented 4 years ago

You are having trouble with your paths on the Kaldi side. I am sorry that I cannot help here, but you have to fix your paths. does exp/libri_MLP_fmllr/lang_test_tgsmall/G.fst exists ? or is it exp/libri_fmllr/lang_test_tgsmall/G.fst ? The best solution for you is to look at https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/lmrescore_const_arpa.sh and understand properly what this script requires as arguments. Then you can reorder your paths accordingly.

omprakashsonie commented 4 years ago

I think the issue is with step 3. May be either there is different step 3 script for lmrescore or some issue with current step 3.

It is not creating 'lang_test_tgsmall' or having file G.fst or other required files for which lmrescore_const_arpa.sh is throwing error.