srvk / eesen-transcriber

EESEN based offline transcriber VM using models trained on TEDLIUM and Cantab Research
Apache License 2.0
49 stars 14 forks source link

Tiny Language Model Building #19

Closed prashantserai closed 6 years ago

prashantserai commented 7 years ago

I'm working on a project where we're trying to recognize spoken sentences in a very specific technical domain and context using the Virtual Machine based EESEN Offline Transcriber. As one of the experiments, we wanted to try a deterministic language model for a fixed set of sentences and words. I noticed on http://speechkitchen.org/kaldi-language-model-building/ that there's a recipe to build a Tiny Language Model, but I couldn't find the requisite files in the Virtual Machine. Any suggestions as to how I could go about building the same?

riebling commented 7 years ago

The files may have been added since you downloaded the VM; I think a 'git pull' from the srvk/lm_build repository may get the newest? Or have a look and copy it directly from: https://github.com/srvk/lm_build/blob/master/make_tinylm_graph.sh

prashantserai commented 7 years ago

Thanks for your response!

I downloaded that file into ~/eesen/asr_egs/tedlium/v2-30ms/lm_build and copied training_trans_fst.py from ~/tools/eesen-offline-transcriber/local/ to the same place.

My example_txt was a sequence of words like this: "a afternoon alex all am and any application applications apply are at autumn award be between by closes conferences covers day deadline deleted doing dot each edu eleven email everyone feel fifty finds first for free funding good great have hope i if including into is january know let march materials me message nine note november now occurring of one open osu out outside period please pm questions ray reach recommendation recovered responses saw semester sixteen submitted that the third thirty this three to today travel tuesday twenty unable we well when will window writing you"

which are all the different words that are used in my audio separated by spaces.

I'm confused although, what exactly should I expect the created deterministic language model to do, select the most probable word out of these 97 possible words, or?

I ask because the words.txt created in the tinylm folder has a list of 150k words roughly. Seems like it took stuff from ../data/lang_phn_test_test_newlm as well.

What exactly does make_tinylm_graph.sh take and what does it create?

riebling commented 7 years ago

It's more geared to work with example sentences, than a bag-of-words. So you'd want example text of every permutation of permissible sentence you'd like the system to recognize.

Training the 'tiny' LM still uses a general-purpose dictionary to create a lexicon "L", along with the example_txt sentences creates a grammar "G" (every possible sequence of words from first-word to last-word in a sentence), and the already-provided graph of tokens, "T" (This version of Eesen is trained to use phonemes as the tokens, but characters are another way it can work). These are all composed together into a decoding graph TLG.fst.

If you have words you want to add that aren't in the provided general-purpose dictionary, you'll need to add them, as in the earlier instructions

When decoding audio to produce text, audio is first converted to a sequence of features. These features go into the already-provided trained neural network that represents the acoustic model, which predicts a sequence of tokens (phonemes). This sequence, along with the decoding graph TLG.fst produces text results, as the end result of the decode process.

In your example, only if someone speaks words in the order they were provided, is the system likely to produce output that makes sense - but even then it will not make sense because the words aren't in a sensible order

riebling commented 6 years ago

closing; aged out