lowerquality / gentle

gentle forced aligner
https://lowerquality.com/gentle/
MIT License
1.42k stars 295 forks source link

External dictionary for OOV (out of vocabulary) words #158

Open rafaelvalle opened 6 years ago

rafaelvalle commented 6 years ago

Is it possible to use an external vocabulary, e.g. CMU's arpabet, with Gentle? Would be useful to be able to add oov words...

shivam-mangla commented 4 years ago

Hey @rafaelvalle, did you find a solution for this?

iskunk commented 4 years ago

Documentation is needed for how the current language model is built. I've made some progress scripting some parts of the process (especially as the CMU dictionary is quite a bit larger than the one currently bundled with Gentle), but I am not familiar enough with the Kaldi tools nor the LM build process in general to suss out everything.

strob commented 4 years ago

Hey @iskunk - this blog post is very helpful.

iskunk commented 4 years ago

Thanks @strob, I will look through it and see what scripting I can put together. (I assume these are largely the steps you followed to build Gentle's existing language model.)

There was one question that left me scratching my head, however: why is the current align_lexicon.txt so much smaller than the CMU dictionary? For example, the former lacks the word "ahoy." The first thing I was thinking of doing was converting the CMU dictionary, but then I figured there must be a reason why you haven't done so already.

(Also: Is it feasible to support pronunciation variants, as the CMU dictionary has?)

iskunk commented 4 years ago

All right, I've dug through this a bit. I see where Gentle's dictionary comes from (under data/lang_chain/phones/ in 0001_aspire_chain_model.tar.gz), and that it is probably limited to words that appeared in the original corpus used to train this model.

FYI, the workflow I'm looking to support here is adding a custom list of new/OOV words with hand-crafted ARPABET-format pronunciations. That's what I want, and I think what @rafaelvalle and others would want. (I'm specifically avoiding a G2P step, partly because figuring out phonetic transcriptions for a small set of words is relatively easy, and partly because the G2P tools look like a PITA to deal with.)

The instructions at that blog, however, appear oriented towards building a language model from a proper corpus; i.e. processing reams of prose text (complete sentences) to build up a language grammar, with n-grams of up to size 3. All I have is a small list of words, taken out of context!

When we're doing forced alignment, isn't the language grammar not used? My understanding is that, in "normal" speech-to-text, when an utterance can be matched to potentially multiple words, the grammar is used to determine which one is most likely. ("Credit card" is more likely correct than "credit cord," in most cases.) But in alignment, you already know the specific word that should be matched to the utterance---you're just fitting the utterance into the known acoustic features of the word.

So my first thought is, shouldn't Gentle be using a form of the language model that leaves out the grammar? According to the diagram on that page, it should only need the acoustic model, and the word lexicon.

My second thought, on a possible workaround... would it work to build a "degenerate" version of the language model, that has only 1-grams, all with the same probabilities? If the grammar is not used, then this should not hurt things. But then, I'm not sure if Gentle is relying on the grammar in some way.

@strob, what are your thoughts? I haven't worked with Kaldi a great deal, and could use some guidance.

tscizzle commented 2 years ago

What is gentle's dictionary / vocabulary / lexicon?

I ran install_language_model.sh (as sudo, so sudo ./install_language_model.sh), and it generated an exp/ directory. Note this is not the same as ext/.

Inside exp/, there is langdir/phones/align_lexicon.txt which appears to have all the words gentle knows, mapped to the phones that they each consist of. (Unconfirmed this is what gentle uses, but it seems likely.) langdir/words.txt and langdir/phones.txt may also be helpful for you.

Reason this was hard to find for me was that I did not need to run install_language_model.sh for using gentle so I didn't even see it. Only ended up needing it for finding these files / documentation.

Can we extend these for our use with gentle?

The fact that I didn't seem to need install_language_model.sh for using gentle leads me to believe that editing any of these files in exp/ would do nothing. So still not sure if we can extend the vocabulary (without diving deep into the source code and/or working with Kaldi directly).