gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
443 stars 86 forks source link

Adding new word in the vocabulary #79

Open ckobus opened 4 years ago

ckobus commented 4 years ago

Hi,

I would like to use the pretrained acoustic model for English but use it in combination with a new in-domain language model, for which I have to generate pronunciations.

I am used to the Kaldi toolkit and the CMU dictionary, which uses the ARPA alphabet. I saw in your repo the script to convert the CMU dictionary to the IPA format but when I look at the phones.txt file associated to the acoustic model, I do not recognize the IPA format. For ex, to which phoneme tS corresponds to in the ARPA alphabet?

I hoope my question is clear enough.

Thank you for your answer!

CK

joazoa commented 4 years ago

Hello,

I'm quite new to Kaldi and not qualified to answer, but does this help ? https://github.com/kaldi-asr/kaldi/blob/1ff668adbec7987a8b9f91932a786ad8c4b75d86/src/doc/data_prep.dox (Search for words.txt and https://white.ucc.asn.au/Kaldi-Notes/fst-example/ and https://kaldi-asr.org/doc/graph_recipe_test.html

I hope it helps at least with the format question.

ckobus commented 4 years ago

Thanks for the answer but what I want to do is to add a set of new in-domain words in the vocabulary (I do not want them to be considered as OOV). To do that, I need to generate pronunciations for them. Pronunciations coming either from CMU dictionary, or coming from a G2P system. The problem is that CMU dict uses a set of phonemes (ARPA alphabet I think) but in the zamia speech model, the set of phonemes you can find in the phones.txt file, is not the same (following IPA?) and I would like to know how to easily map the set of new pronunciations into the set of correct phoneme labels handled by the pre-trained acoustic model.

ckobus commented 4 years ago

Any hint?

joazoa commented 4 years ago

Would it work to add the words to the original dict.ipa, use the scripts to generate the new phones.txt and the graph and use those for rescoring on decoding ?

svenha commented 4 years ago

@joazoa If the new word is not in the language model, you have to extend the language model too. An approach is provided by this repo: https://github.com/gooofy/kaldi-adapt-lm

joazoa commented 4 years ago

Yes, sorry forgot to mention that part, you'd have to run kenlm again also.

ckobus commented 4 years ago

Yes but with kaldi-adapt-lm, it seems you only restrict to the words the model is already able to recognise (ie words part of the lexicon). cf. "we also want to limit our language model to the vocabulary the audio model supports, so let's extract the vocabulary next" In my case, I want to use an in-domain Language model with a lot of new words, that are OOV for the current model. My question is how to generate pronunciations that are compliant with the phoneme set of the model? So far, with Kaldi, I worked with pronunciations with ARPABET symbols, which do not match with the ones in the English model. Anyone already tried to do this?

gooofy commented 4 years ago

you can use the speech_lex_edit.py script to add new words to the dictionary. the original dict uses IPA phoneme symbols - for the kaldi models those get converted to XSAMPA AFAIR. you can find translation tables as well as mapping helper functions here:

https://github.com/gooofy/py-nltools/blob/master/nltools/phonetics.py

besimali commented 4 years ago

Did you manage to do this? @ckobus

ckobus commented 4 years ago

Sorry, I just noticed your message. Yes, I finally succeeded; I had to adapt the script to convert pronunciation from the ARPAbet alphabet to the IPA one and then I adapted the Kaldi script prepare_lang.sh to create a new L.fst. At the end, the engine works quite well on my domain. Thanks for the quality of the acoustic models!!

ammyt commented 4 years ago

Hi @ckobus ,which scripts did you use and from where after converting to ipa? Can you please clarify?

fquirin commented 3 years ago

@gooofy , @ckobus , @ammyt I'm pretty confused about the phoneme set as well right now. When I have an IPA result do I use SAMPA, X-SAMPA, Conlang X-SAMPA (the " doesn't really exist in lexicon.txt) , X-ARPABET or any variation of this? :sweat_smile: Did anyone figure this out?

abdullah-tayeh commented 3 years ago

Hi @fquirin , there is a script in the package that does the conversion automatically (at least for german) I think it was speech_lex_edit. You basically use speech_lex_edit then type the word in german then it does the conversion automatically for you

fquirin commented 3 years ago

Hi @abdullah-tayeh , thanks for the note :-) I followed the breadcrumbs and I think they lead to ipa2xsampa, but looking at the translation table it differs at least in one point from the official X-SAMPA standard using a different apostrophe for "primary stress": ' instead of ". I wonder what else is different :thinking:

gooofy commented 3 years ago

@fquirin, please check out the tables in https://github.com/gooofy/py-nltools/blob/master/nltools/phonetics.py which should contain all the phonemes used in zamia-speech

fquirin commented 3 years ago

hey @gooofy , yes that's where I found ipa2xsampa but when I compared it to Gruut-IPA sampa conversion I realized its using the wrong apostrophe for "primary stress". So far this is the only difference I've found but I didn't check all the phonemes.

I'm building a new version of kaldi-lm-adapt and wanted to add an espeak-to-zamia feature (espeak IPA) for new lexicon entries :slightly_smiling_face: . Btw the 2019 Zamia Kaldi models still rock :sunglasses: :+1:

gooofy commented 3 years ago

AFAIR I decided against the concept of "primary stress" vs "secondary stress" when designing the zamia phoneme set, instead I went with general "stress" marks which can appear multiple times within one word. Main reason was dealing with german compound words but also practicality: zamia's phoneme set is geared towards dealing with tts results which can contain arbitrary numbers of stress marks depending on the tool used. In fact, I don't recall any tts engine distinguishing primary and secondary stress.

fquirin commented 3 years ago

Thanks for the explanation @gooofy ! I tried to search for info about "AFAIR" before but couldn't find anything ^^. I can't say that I fully understand how to work with "primary stress" and "secondary stress", but according to your explanation I should be safe if I convert IPA to XSAMPA and then replace the apostrophe? Or maybe even better, use the normalization given in the file?

IPA_normalization = {
        u':' : u'ː',
        u'?' : u'ʔ',
        u'ɾ' : u'ʁ',
        u'ɡ' : u'g',
        u'ŋ' : u'ɳ',
        u' ' : None,
        u'(' : None,
        u')' : None,
        u'\u02c8' : u'\'',
        u'\u032f' : None,
        u'\u0329' : None,
        u'\u02cc' : None,
        u'\u200d' : None,
        u'\u0279' : None,
        u'\u0361' : None,
}
gooofy commented 3 years ago

https://en.wiktionary.org/wiki/AFAIR#:~:text=(Internet%20slang)%20Initialism%20of%20as,as%20far%20as%20I%20remember.

gooofy commented 3 years ago

From my experience converting from IPA can always be difficult, depending on the source. That IPA-normalization table grew when I started extracting IPA from wiktionary and is certainly by no means complete (or correct, for that matter).

fquirin commented 3 years ago

Ok weird, shouldn't there be a clear set of characters and conversion rules for IPA to X-SAMPA? :confused: I was planning on using espeak-ng IPA (espeak-ng -v de -x -q --sep=" " --ipa "test") as main source :thinking:

fquirin commented 3 years ago

To be honest I don't understand this IPA normalization table entirely :thinking: . For example those characters:

u'ɾ' : u'ʁ',
...
u'ŋ' : u'ɳ',

All 4 of them exist in the IPA table and have a different purpose. Why would you convert one to another?

[EDIT] And I think u'\u0279' : None, should actually be u'\u0279' : u'r', :thinking:

gooofy commented 3 years ago

I am by no means an expert here, maybe you should discuss these questions with someone more proficient in the field of (computer-)linguistics.

That said, here is my take: IPA is typically written by humans for humans to convey some idea of how a written word could be pronounced. I came across dozens of wiktionary IPA entries that looked very sensible to me until I fed them into a TTS system and listened to what that system produced out of it. IPA defines a huge number of phonemes and lots of additional symbols - all that helps conveying pronunciations to humans and supporting lots of different languages.

Designing a phoneme set for machines to produce mathematical models of human speech is a very different affair: typically you want a small set of phonemes especially when you start with a relatively small set of samples - the larger your phoneme set, the more phonemes will have very few samples (or none at all) they occur in causing instabilities in your model.

But even if you have a large sample base there is still the question what good additional phonemes will do to your model - will those additional phonemes really improve recognition performance or the quality of the speech produced? At some point you will also face the question of which phonemes actually exists in nature and which of them you want to model - after all, speech is a natural phenomenon analog world which you model model using discrete phonemes. In fact, even amongst linguists these questions seem debatable:

https://en.wikipedia.org/wiki/Phoneme#The_non-uniqueness_of_phonemic_solutions

one of my favorite examples in the german language is r vs ʀ vs ʁ - which one of them is used differs by region/dialect - so in this case it comes down to the question whether you want to model dialects in your pronunciation dictionary - in zamia, I definitely decided against that but of course other designers may decide otherwise for their phoneme set.

fquirin commented 3 years ago

Thanks again for the background info. I see now, its not a trivial problem to solve :grin: .

So, back at the drawing board, what's actually the best way to generate new words for the Zamia lexicon.txt files? :man_shrugging: Is there a chance to use espeak (IPA or "normal") and get the correct set of supported phonemes? Or do we need to use the G2P models? Or do we need to implement a manual procedure (generate automatically, check if phonemes are ok, adapt by hand)?

NOTE: The reason why I would like to use espeak is because I can create the phoneme set by actually listening to it (looking at the original 'speech_lex_edit.py' file I think you had the same intention).

gooofy commented 3 years ago

In my experience if you want high quality lexicon entries there is no way around checking them manually. In general I would use speech_lex_edit to add new entries to the dictionary (either directly or through speech_editor while reviewing samples). Inside that tool you have options to generate pronounciations via espeak, mary tts and sequitur g2p. Usually I would listen to all three and pick the best one, sometimes with manual improvements (like fixing stress marks etc.).