senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano
Apache License 2.0
81 stars 29 forks source link

Support for rescoring lattices #5

Closed senarvi closed 8 years ago

senarvi commented 8 years ago

Enable the user to rescore word lattices in addition to n-best lists.

sameerkhurana10 commented 8 years ago

Hi,

Is there any update on this? Any direction on how to use theanoLM for rescoring kaldi lattices would be a life saver!

Thanks

senarvi commented 8 years ago

Currently I don't have time to implement this. I've used TheanoLM only for rescoring n-best lists created from lattices. I don't know how much rescoring lattices directly would improve.

You can either 1) convert Kaldi lattices to SLF using _convert_slfparallel.sh from Kaldi, and the use lattice-tool from SRILM to create n-best lists, or 2) extract n-best list from Kaldi lattices using lattice-to-nbest, rescore using TheanoLM, and convert back to Kaldi format using nbest-to-lattice. I have a script for this. I can put it in the repository.

senarvi commented 8 years ago

Here are two scripts that I would place in the directory hierarchy of a Kaldi recipe:

https://github.com/senarvi/theanolm/blob/master/kaldi/steps/lmrescore_theanolm_nbest.sh https://github.com/senarvi/theanolm/blob/master/kaldi/utils/theanolm_compute_scores.sh

They can be used in a similar way to _lmrescore_rnnlmlat.sh. Let me know if this works out for you.

sameerkhurana10 commented 8 years ago

Thank you, will try it out..

sameerkhurana10 commented 8 years ago

Hi,

I was trying to rescore the lattices using the script provided here. I was able to generate the rescored lattices without any problems or errors in the log. But when I try to run my score.sh on lattices that I got from running LSTM rescoring script, I get the following in the log:

mkdir -p exp/mer80/chain/tdnn_6z_sp/decode_dev_non_overlap_lstmlm.h300.voc20k/nnlm_weight_0.3/score_13/ && lattice-1best --lm-scale=13 "ark:gunzip -c exp/mer80/chain/tdnn_6z_sp/decode_dev_non_overlap_lstmlm.h300.voc20k/nnlm_weight0.3/lat..gz |" ark:- | lattice-align-words exp/mer80/chain/tdnn_6z_sp/graph/phones/word_boundary.int exp/mer80/chain/tdnn_6z_sp/decode_dev_non_overlap_lstmlm.h300.voc20k/final.mdl ark:- ark:- | nbest-to-ctm ark:- - | utils/int2sym.pl -f 5 exp/mer80/chain/tdnn_6z_sp/graph/words.txt | utils/convert_ctm.pl data/dev_non_overlap_hires/segments data/dev_non_overlap_hires/reco2file_channel > exp/mer80/chain/tdnn_6z_sp/decode_dev_non_overlap_lstmlm.h300.voc20k/nnlm_weight_0.3/score_13/dev_non_overlap_hires.ctm lattice-1best --lm-scale=13 'ark:gunzip -c exp/mer80/chain/tdnn_6z_sp/decode_dev_non_overlap_lstmlm.h300.voc20k/nnlm_weight0.3/lat..gz|' ark:- nbest-to-ctm ark:- - lattice-align-words exp/mer80/chain/tdnn_6z_sp/graph/phones/word_boundary.int exp/mer80/chain/tdnn_6z_sp/decode_dev_non_overlap_lstmlm.h300.voc20k/final.mdl \ ark:- ark:- WARNING (lattice-align-words:OutputArcForce():word-align-lattice.cc:591) Discarding word-ids at the end of a sentence, that don't have alignments. WARNING (lattice-align-words:main():lattice-align-words.cc:105) Outputting partial lattice for 01BE8E7B-C179-42E3-8521-109C2C732334_spk-0001_seg-0001345:0001\ 695 WARNING (lattice-align-words:OutputArcForce():word-align-lattice.cc:591) Discarding word-ids at the end of a sentence, that don't have alignments. WARNING (lattice-align-words:main():lattice-align-words.cc:105) Outputting partial lattice for 01BE8E7B-C179-42E3-8521-109C2C732334_spk-0001_seg-0001695:0002\ 182 WARNING (lattice-align-words:OutputArcForce():word-align-lattice.cc:591) Discarding word-ids at the end of a sentence, that don't have alignments. WARNING (lattice-align-words:main():lattice-align-words.cc:105) Outputting partial lattice for 01BE8E7B-C179-42E3-8521-109C2C732334_spk-0001_seg-0002182:0002\ 565 WARNING (lattice-align-words:OutputArcForce():word-align-lattice.cc:591) Discarding word-ids at the end of a sentence, that don't have alignments. WARNING (lattice-align-words:main():lattice-align-words.cc:105) Outputting partial lattice for 01BE8E7B-C179-42E3-8521-109C2C732334_spk-0001_seg-0003154:0003\ 720 WARNING (lattice-align-words:LatticeWordAligner():word-align-lattice.cc:263) [Lattice has input epsilons and/or is not input-deterministic (in Mohri sense)]-\

My score.sh works by first generating ctm files from the lattices and then scoring against a reference stm file using sclite. The above problem is the output of the ctm file generation step:

$cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring/log/best_path.LMWT.log \ mkdir -p $dir/score_LMWT/ '&&' \ lattice-1best --lm-scale=LMWT "ark:gunzip -c $dir/lat.*.gz|" ark:- | \ lattice-align-words $lang_or_graph/phones/word_boundary.int $srcdir/final.mdl ark:- ark:- | \ nbest-to-ctm ark:- - | \ utils/int2sym.pl -f 5 $symtab | \ utils/convert_ctm.pl $data/segments $data/reco2file_channel \ '>' $dir/score_LMWT/${name}.ctm || exit 1;

The $dir here is: exp/mer80/chain/tdnn_6z_sp/decode_dev_non_overlap_lstmlm.h300.voc20k/nnlm_weight_0.3

which has the rescored lattices using LSTM

Any ideas or comments on what might be going wrong?

sameerkhurana10 commented 8 years ago

BTW, I ran this command before score.sh:

steps/lmrescore_theanolm_nbest.sh --N 10 --cmd "$decode_cmd" --lm-scale 9 --nnlm-weights "0.3 0.75 1.0" --use-phi true --vocab-format words data/lang_fg_all \ /data/sls/qcri-scratch/sameer/language_modelling/theanoLM/model.h5 /data/sls/qcri-scratch/sameer/language_modelling/theanoLM/wlist/input.vocab exp/mer80/chai\ n/tdnn_6z_sp/decode_dev_non_overlap_fg exp/mer80/chain/tdnn_6z_sp/decode_dev_non_overlap_lstmlm.h300.voc20k

senarvi commented 8 years ago

I have so little experience with Kaldi that I'm not even sure if the warnings you presented are something one should be worried about. I can only guess that a lattice-determinize might help, but I don't know if that's the right thing to do.

I was able to rescore Kaldi lattices and compute the scores with sclite using this script that I now put in the repository too. It doesn't create .ctm files. It converts the transcripts to .trn and normalizes using csrfilt.sh (from SCTK).

Unfortunately I can't find that experiment anymore, so I don't have the exact commands. I'm currently rebuilding my test system, but I can try that again in a couple of weeks.

One suspicious thing I noticed, is that you use --use-phi true, but looks like your input directory might come from decoding. Then you shouldn't use that (and I didn't use that).

Another thing I noticed is that lmrescore_theanolm_nbest.sh still expected the vocabulary argument, even though the vocabulary is not needed anymore for evaluation. I removed that from the scripts, though that shouldn't have caused any harm.

sameerkhurana10 commented 8 years ago

Thanks for replying.

Actually, I found that the problem is the lattice-rmali command in the lmrescore_theanolm_nbest.sh script. Actually, it is not a problem. We can score using normal scoring, but if you need to generate ctm file, you need the alignments. I am testing it right now, with removing that piece of code. I will let you know if it works.

senarvi commented 8 years ago

Got it. If your solution works with both .ctm and .trn files, feel free to make a pull request, or send it to me and I'll add it to the repository.