Building Language Model with low amount of data

prashantserai commented 7 years ago

I'm looking to build a language model with a small amount of text, and for experimental purposes I'm also trying with a small amount of example_txt.

So when I was earlier using the instructions under the heading "Adapting your own Language Model for EESEN-tedlium" here

cd ~/eesen/asr_egs/tedlium/v2-30ms/lm_build
./train_lms.sh example_txt local_lm
cd ..
lm_build/utils/decode_graph_newlm.sh data/lang_phn_test

When I tried with an example text of merely 145 words, it was successful in building a language model but the results were pretty bad. Most of the words in the transcript were from outside the example_txt. So, I tried modifying wordlist.txt to only include about ~90 words which were only the words I expected in the transcript.

I got an error like: _computeperplexity: no unigram-state weight for predicted word "BA" (I think it was something other than "BA", "BH" or something... I can find out if it's important)

I played around and realized that the wordlist.txt had to be at least about 47k odd words and that would get rid of the error. So I padded with fake words full of symbols and things were working. (although not as optimally as I'd like)

Since run_adapt.sh seemed to be a better recipe as I wrote about in another discussion, i tried that. Even keeping the original large dictionary, if I just reduced the example_txt to be a small piece of 145 words, it repeatedly gave the message:

_"compute_perplexity: for history-state "", no total-count % is seen (perhaps you didn't put the training n-grams through interpolatengrams?)"

and eventually ended with:

_"Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/vagrant/eesen/tools/kaldi_lm/optimize_alpha.pl line 23. Expecting files adapt_lm/3gram-mincount//ngrams_disc and adapt_lm/3gram-mincount//../word_map to exist E.g. see egs/wsj/s3/local/wsj_train_lm.sh for examples. Finding OOV words in ARPA LM but not in our words.txt gzip: adapt_lm/3gram-mincount/lm_pr6.0.gz: No such file or directory Composing the decoding graph using our own ARPA LM No such file adapt_lm/3gram-mincount/lmpr6.0.gz"

Any thoughts on why these errors are occurring? (more interested in the run_adapt.sh recipe now)

I guess in our target application, we will have much more data than this, but the example_txt could still be much smaller compared to the original example_txt file (specific and narrow domain). So I guess it's worthwhile for me to understand the problem occurring above beyond the scope of the toy experiment too.

riebling commented 7 years ago

I'm not familiar enough with the lower-level scripts and programs to truly understand these errors. In general we know it does fail without larege enough sized training data. Your experiment with smaller wordlist and 'fake' words is interesting, and helped define 'enough'! Clearly one of the steps failed to produce some files needed by later steps ("Expecting ngrams_disc and word_map to exist" and lm_pr6.0.gz: No such file or directory")

I see an earlier message from you where we suggested using the deterministic "tinylm" recipe (make_tinylm_graph.sh), and a set of example sentences with every expected permutation of sentence as the training text. Does that work for this application? This kind of thing helps if you are building a dialog system with a small set of recognized words and spoken commands.

One other TIP that needs to be documented more prominently (I'll get right on that) - when decoding with a small, deterministic LM built in this way, it helps to use a larger beam setting in the decoder. We found that setting the beam to 23.0 as opposed to the default 15.0 helped a lot using our system to do word-level alignments of text+audio.

prashantserai commented 7 years ago

I think that 'enough' was defined for a recipe that might now also be buggy given that it was failing at recreating the original LM or something close to it.

More interested in the run_adapt.sh recipe now. It's surprising that even though using the full set of words, it gives an error even when just the example_txt is small. Maybe I should look deeper into it and find out why.

The "tinylm" was primarily for experimental purposes again, it won't work for our final application. It's speech recognition for a specific defined task, but the phrases spoken per se would be naturally generated.

prashantserai commented 7 years ago

I had gotten this to work, so thought I'd share my solution here.

In train_lm.sh (~/eesen/tools/kaldi_lm/train_lm.sh) the first line was heldout_sent=10000, I changed it to a lower value (in my case, heldout_sent=100 worked!)

As I understand this roughly (exactly?) relates to the number of lines in the example_txt file that are held out during training, to tune the discounting parameters of the Kneser-Ney language model. In my case the most recent example_txt file I was trying was merely about 5000 lines.

Here's the relevant comments from the file:

# We assume that we want to use the first n sentences of the
# training set as a validation and tuning set (they'll be used,
# for example, to estimate discounting factors). For now
# its size is hardwired at 10k sentences.

Someone had a similar problem with KALDI and had posted about it here. Dan Povey had responded to it and amongst other things hinted that, heldout_sent should be substantially less than the total number of sentences (e.g. one tenth of it). (His main suggestion was to use SRILM instead of kaldi_lm)

In my case, I tried heldout_sent=500 and that didn't work, but when I tried heldout_sent=100, that worked for me.

My understanding is, one would want choose the largest value of heldout_sent that works for them without crashing, and possibly also randomly shuffle the sequence of sentences in their example_txt file.

riebling commented 7 years ago

This is very helpful, vs. the hard-coded 10000 being rather UN-helpful (although to be fair, it could be overridden as a command line option to train_lm.sh). I had faced similar struggles working with a 'tiny' LM, and had tried a variation of train_lm.sh as a work-around: for certain parts of the training text, user-added sentences were never making their way into the final LM because they were held out. My first clue this was happening was discovering that adding sentences to beginning of the training text worked differently than than at the end. This eventually led to discovering how the 10k sentences were just being "chopped off" and disregarded. This also explained why training texts much smaller than 10k sentences resulted in errors.

A work-around recipe computes ngram and discounted ngram data as unheldout_ngrams.gz and unheldout_ngrams_disc.gz for the WHOLE of the training text, including heldout sentences: lm_build/utils/train_lm_10k.sh Diff:

diff kaldi_lm/train_lm.sh utils/train_lm_10k.sh 
96a97,103
>
>     # er1k experiment
>     gunzip -c $dir/train.gz | head -n $heldout_sent | \
>         get_raw_ngrams 3 | sort | uniq -c | uniq_to_ngrams | \
>      sort | discount_ngrams $subdir/config.get_ngrams | \
>      sort | merge_ngrams | gzip -c > $subdir/unheldout_ngrams.gz
>
371a379,383
>   gunzip -c $subdir/unheldout_ngrams.gz | \
>    discount_ngrams $subdir/config.$num_configs | sort | merge_ngrams | \
>    gzip -c > $subdir/heldout_ngrams_disc.gz
>
>
384c396
<   gunzip -c $subdir/ngrams_disc.gz | \
---
>   gunzip -c $subdir/ngrams_disc.gz | cat - <(gunzip -c $subdir/unheldout_ngrams_disc.gz) | \

Eventually I abandoned this approach (for adding custom sentences to an existing general-English training text), in favor of the deterministic 'tiny lm' approach, for when there are really only a handful of sentences (or dialog system commands) to be recognized. Which reminds me that someone should put that approach back into our VM for dialog systems :-)

Thanks for your input, this discussion has really helped.

prashantserai commented 7 years ago

Just wanted to add a few more things: -In our specific case, we're looking to recognize limited kinds of commands, sure, but the user is free to express those commands in their own language. A deterministic LM doesn't quite work for us. -Even after changing the heldout_sent to a small number such as 50, when my file is 5000 sentences, it crashes on certain random splits of the data (or certain random shuffles of the data, rather). The errors that I get are of a similar kind as described above. I am currently making do with a (non-deterministic) LM built on the shuffle for which the code executed successfully, but a better solution would be good.

riebling commented 7 years ago

I have also observed crashes of the lower level kaldi_lm tools, and didn't know at the time whether it was a fault of my configuration, or the tools themselves. Curious if you have observed, was it one of:

interpolate_ngrams
discount_ngrams
get_raw_ngrams
merge_ngrams
compute_perplexity
get_perplexity

prashantserai commented 7 years ago

The error is very much like

compute_perplexity: no unigram-state weight for predicted word "BA"

any clues?

riebling commented 7 years ago

What's interesting is Dan Povey's suggestion to not even use KaldiLM tools, but to favor SRILM here.
If we open up the possibility to try other tools, I think Florian is fond of getting people to use KenLM. I don't have experience with these tools to provide further support, though.

srvk / lm_build

Building Language Model with low amount of data #1