srvk / eesen-transcriber

EESEN based offline transcriber VM using models trained on TEDLIUM and Cantab Research
Apache License 2.0
49 stars 14 forks source link

Adding Technical Words to Dictionary #18

Closed bakerstreetsystems closed 7 years ago

bakerstreetsystems commented 7 years ago

This package is awesome so far! It was WAY simpler to get everything up and going than any of the other methods (i.e. installing SMU Sphinx or Kaldi directly from source). So thank you for a great package.

I'd like to be able to transcribe very technical audio recordings with words like Linux, Laravel, or MySQL, which don't get transcribed very well. How would I go about (easily) adding these words to the transcription software so that they are successfully recognized?

riebling commented 7 years ago

There's some info here that describes how to add new terms to the language model. There's even a feature that tries to guess phonetic pronunciation and generate dictionary entries for new words, though I imagine you could improve on the results if you have a better grasp of the pronunciation(s).

bakerstreetsystems commented 7 years ago

Thank you for your help so far! I've attempted to follow the directions suggested here.

I can successfully run the run_adapt.sh script after adding new vocab to newwords.txt, but when I try to use the updated language model to transcribe the audio file with the new vocab, it doesn't recognize the new vocab.

Here is a video of my attempt to follow the directions on how to adapt the language model:

https://www.youtube.com/watch?edit=vd&v=-Zn9_y56R4c

Any suggestions?

riebling commented 7 years ago

Suggestions? sure!

I may have left out a step. You not only have to add words to the dictionary (and let the system add phonetic pronunciations), but also add to the example_txt adaptation text, with examples of the words "in use". Otherwise the language model being constructed has no statistical likelihoods of the new word being linked to previous or subsequent word (sequences).

It's been our experience that to get new phrases to be recognized - with new words, we need to repeat the examples in the training text quite a bit, sometimes over a hundred times, just to increase the statistical likelihood to better the chances the new words and phrases will be predicted during decoding. Appending or pre-pending (which should now not make a difference, though it used to) to the file example_txt has to happen somewhere in the sequence.

Thanks for noticing and trying this out. We should update the documentation to reflect this :)

bakerstreetsystems commented 7 years ago

Thanks for the reply!

Though my video doesn't show me adding any examples of the new vocab being used in sentences, I did try to do this on my own. I added phrases like "i like to program using laravel" and "sometimes laravel is the best tool to use and sometimes it is not" to the example_txt file and then ran the run_adapt.sh script. When that did not work, I tried adding a few more phrases with the word "laravel" in them and then I copied and pasted all the Laravel-related phrases many times to (hopefully) increase their statistical relevance. That didn't work either :-(

Any other suggestions?

riebling commented 7 years ago

Then it's getting to the voodoo stage. I remember trying to verify new words could be recognized, and seeing different behavior depending on whether I added to the beginning or end of example_txt. There was a situation I believe is fixed, whereby if you added to the beginning vs. the end, there was a difference, because the scripts were automatically holding out the first 10,000 examples... and so new words didn't even take effect until the new word usage examples exceeded 10,000 lines. But I'm pretty sure it's no longer doing that (we train on ALL the example_txt, and don't leave out the first 10,000)

There's also a problem if you start trying to REDUCE the size of example_txt since it is assumed to be much larger than 10,000 lines (I count 183710). So to try an extreme 'crazy' example, what if you included something like 400 repetitions of your word 'in use', both at the beginning and end of example_txt? If it still doesn't recognize, then I'm wondering if something's weird about the pronunciation that gets obtained from the online tool, added to newdict.dct: laravel L AE R AH V AH L - maybe you could try modifying the phonetic pronunciation, since this pronunciation was just an algorithmic guess by http://www.speech.cs.cmu.edu/tools/lextool.html

Perhaps instead: laravel L EH R AH V EH L or: laravel L AA R AH V EH L

In fact you could include both pronunciations.

bakerstreetsystems commented 7 years ago

Still no luck. Here is what I did:

Here is the result:

program_with_laravel 1 0.03 1.56 i 1.00 program_with_laravel 1 1.59 0.03 like 1.00 program_with_laravel 1 1.62 0.27 to 1.00 program_with_laravel 1 1.89 0.42 program 1.00 program_with_laravel 1 2.31 0.19 with 0.76 program_with_laravel 1 2.56 0.19 a 0.27 program_with_laravel 1 2.79 0.60 tell 0.32

riebling commented 7 years ago

Including modifying the pronunciation dictionary entry before run_adapt.sh? (I added that as a later edit) README.md now describes the process

bakerstreetsystems commented 7 years ago

Woohoo!!! It works!!! I edited the newwords.dct file and replaced laravel L AE R AH V AH L with laravel L EH R AH V EH L.

But then the run_adapt.sh script would automatically overwrite my changes (because it was re-querying the CMU Speech tool to get the default pronunciation). So I added a little bit of code (below) to the run_adapt.sh script to allow a pronunciation override. Then I created a file called pronunciation_overrides.txt for, you guessed it, the override pronunciations.

The code below should be added to run_adapt.sh right after the part where it's automatically looking up the pronunciation from CMU Speech tool and right before it says "Constructing the phoneme-based lexicon". As of today, you can paste this code after line 59.

# Added by Jason Jensen to allow for pronunciation override
# If there are any words that you would like to change the default pronunciation for, 
# enter them in the pronunciation_overrides.txt file in this same directory 
# (if the file doesn't exist, create it). The format should be the same as the default dictionary
# Example for adding Laravel (a great PHP framerwork) to the dictionary:
# laravel L EH R AH V EH L

if [ -f pronunciation_overrides.txt ]; then
    echo "Looping through pronunciation overrides found in pronunciation_overrides.txt:"
    while read line || [ -n "$line" ]; do
        set -- $line
        echo "   $line"
        sed -i "/$1 /c $line" newdict.dct  
    done < pronunciation_overrides.txt
fi

And here is a sample of the pronunciation_overrides.txt file:

laravel L EH R AH V EH L

This kind of scripting is not all my expertise, but it works! Woohoo!

riebling commented 7 years ago

And there was much rejoicing! I appreciate your extra scripting (especially knowing it's not your forte) but actually had updated my reply on GitHub to do a slightly-less-inelegant way: directly add pronunciations to the TEDLIUM dictionary file, since it doesn't get rewritten.

Very glad to see this finally worked :)

On Wed, November 30, 2016 5:25 pm, JJ wrote:

Woohoo!!! It works!!! I edited the newwords.dct file and replaced laravel L AE R AH V AH L with laravel L EH R AH V EH L.

But then the run_adapt.sh script would automatically overwrite my changes (because it was re-querying the CMU Speech tool to get the default pronunciation). So I added a little bit of code (below) to the run_adapt.sh script to allow a pronunciation override. Then I created a file called pronunciation_overrides.txt for, you guessed it, the override pronunciations.

The code below should be added to run_adapt.sh right after the part where it's automatically looking up the pronunciation from CMU Speech tool and right before it says "Constructing the phoneme-based lexicon". As of today, you can paste this code after line 59.


# Added by Jason Jensen to allow for pronunciation override
# If there are any words that you would like to change the default
pronunciation for, # enter them in the pronunciation_overrides.txt file in
this same directory # (if the file doesn't exist, create it). The format
should be the same as the default dictionary # Example for adding Laravel
(a great PHP framerwork) to the dictionary:
# laravel L EH R AH V EH L

if [ -f pronunciation_overrides.txt ]; then echo "Looping through
pronunciation overrides found in pronunciation_overrides.txt:" while read
line || [ -n "$line" ]; do set -- $line echo "   $line" sed -i "/$1 /c
$line" newdict.dct
done < pronunciation_overrides.txt fi ```

And here is a sample of the `pronunciation_overrides.txt` file:

`laravel L EH R AH V EH L
`

This kind of scripting is not all my expertise, but it works! Woohoo!

--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/srvk/eesen-transcriber/issues/18#issuecomment-264016072