Open ckobus opened 5 years ago
Hello,
I'm quite new to Kaldi and not qualified to answer, but does this help ? https://github.com/kaldi-asr/kaldi/blob/1ff668adbec7987a8b9f91932a786ad8c4b75d86/src/doc/data_prep.dox (Search for words.txt and https://white.ucc.asn.au/Kaldi-Notes/fst-example/ and https://kaldi-asr.org/doc/graph_recipe_test.html
I hope it helps at least with the format question.
Thanks for the answer but what I want to do is to add a set of new in-domain words in the vocabulary (I do not want them to be considered as OOV). To do that, I need to generate pronunciations for them. Pronunciations coming either from CMU dictionary, or coming from a G2P system. The problem is that CMU dict uses a set of phonemes (ARPA alphabet I think) but in the zamia speech model, the set of phonemes you can find in the phones.txt file, is not the same (following IPA?) and I would like to know how to easily map the set of new pronunciations into the set of correct phoneme labels handled by the pre-trained acoustic model.
Any hint?
Would it work to add the words to the original dict.ipa, use the scripts to generate the new phones.txt and the graph and use those for rescoring on decoding ?
@joazoa If the new word is not in the language model, you have to extend the language model too. An approach is provided by this repo: https://github.com/gooofy/kaldi-adapt-lm
Yes, sorry forgot to mention that part, you'd have to run kenlm again also.
Yes but with kaldi-adapt-lm, it seems you only restrict to the words the model is already able to recognise (ie words part of the lexicon). cf. "we also want to limit our language model to the vocabulary the audio model supports, so let's extract the vocabulary next" In my case, I want to use an in-domain Language model with a lot of new words, that are OOV for the current model. My question is how to generate pronunciations that are compliant with the phoneme set of the model? So far, with Kaldi, I worked with pronunciations with ARPABET symbols, which do not match with the ones in the English model. Anyone already tried to do this?
you can use the speech_lex_edit.py script to add new words to the dictionary. the original dict uses IPA phoneme symbols - for the kaldi models those get converted to XSAMPA AFAIR. you can find translation tables as well as mapping helper functions here:
https://github.com/gooofy/py-nltools/blob/master/nltools/phonetics.py
Did you manage to do this? @ckobus
Sorry, I just noticed your message. Yes, I finally succeeded; I had to adapt the script to convert pronunciation from the ARPAbet alphabet to the IPA one and then I adapted the Kaldi script prepare_lang.sh to create a new L.fst. At the end, the engine works quite well on my domain. Thanks for the quality of the acoustic models!!
Hi @ckobus ,which scripts did you use and from where after converting to ipa? Can you please clarify?
@gooofy , @ckobus , @ammyt
I'm pretty confused about the phoneme set as well right now. When I have an IPA result do I use SAMPA, X-SAMPA, Conlang X-SAMPA (the "
doesn't really exist in lexicon.txt) , X-ARPABET or any variation of this? :sweat_smile:
Did anyone figure this out?
Hi @fquirin , there is a script in the package that does the conversion automatically (at least for german) I think it was speech_lex_edit. You basically use speech_lex_edit then type the word in german then it does the conversion automatically for you
Hi @abdullah-tayeh , thanks for the note :-)
I followed the breadcrumbs and I think they lead to ipa2xsampa
, but looking at the translation table it differs at least in one point from the official X-SAMPA standard using a different apostrophe for "primary stress": '
instead of "
. I wonder what else is different :thinking:
@fquirin, please check out the tables in https://github.com/gooofy/py-nltools/blob/master/nltools/phonetics.py which should contain all the phonemes used in zamia-speech
hey @gooofy , yes that's where I found ipa2xsampa
but when I compared it to Gruut-IPA sampa conversion I realized its using the wrong apostrophe for "primary stress". So far this is the only difference I've found but I didn't check all the phonemes.
I'm building a new version of kaldi-lm-adapt and wanted to add an espeak-to-zamia feature (espeak IPA) for new lexicon entries :slightly_smiling_face: . Btw the 2019 Zamia Kaldi models still rock :sunglasses: :+1:
AFAIR I decided against the concept of "primary stress" vs "secondary stress" when designing the zamia phoneme set, instead I went with general "stress" marks which can appear multiple times within one word. Main reason was dealing with german compound words but also practicality: zamia's phoneme set is geared towards dealing with tts results which can contain arbitrary numbers of stress marks depending on the tool used. In fact, I don't recall any tts engine distinguishing primary and secondary stress.
Thanks for the explanation @gooofy ! I tried to search for info about "AFAIR" before but couldn't find anything ^^. I can't say that I fully understand how to work with "primary stress" and "secondary stress", but according to your explanation I should be safe if I convert IPA to XSAMPA and then replace the apostrophe? Or maybe even better, use the normalization given in the file?
IPA_normalization = {
u':' : u'ː',
u'?' : u'ʔ',
u'ɾ' : u'ʁ',
u'ɡ' : u'g',
u'ŋ' : u'ɳ',
u' ' : None,
u'(' : None,
u')' : None,
u'\u02c8' : u'\'',
u'\u032f' : None,
u'\u0329' : None,
u'\u02cc' : None,
u'\u200d' : None,
u'\u0279' : None,
u'\u0361' : None,
}
From my experience converting from IPA can always be difficult, depending on the source. That IPA-normalization table grew when I started extracting IPA from wiktionary and is certainly by no means complete (or correct, for that matter).
Ok weird, shouldn't there be a clear set of characters and conversion rules for IPA to X-SAMPA? :confused:
I was planning on using espeak-ng IPA (espeak-ng -v de -x -q --sep=" " --ipa "test"
) as main source :thinking:
To be honest I don't understand this IPA normalization table entirely :thinking: . For example those characters:
u'ɾ' : u'ʁ',
...
u'ŋ' : u'ɳ',
All 4 of them exist in the IPA table and have a different purpose. Why would you convert one to another?
[EDIT]
And I think u'\u0279' : None,
should actually be u'\u0279' : u'r',
:thinking:
I am by no means an expert here, maybe you should discuss these questions with someone more proficient in the field of (computer-)linguistics.
That said, here is my take: IPA is typically written by humans for humans to convey some idea of how a written word could be pronounced. I came across dozens of wiktionary IPA entries that looked very sensible to me until I fed them into a TTS system and listened to what that system produced out of it. IPA defines a huge number of phonemes and lots of additional symbols - all that helps conveying pronunciations to humans and supporting lots of different languages.
Designing a phoneme set for machines to produce mathematical models of human speech is a very different affair: typically you want a small set of phonemes especially when you start with a relatively small set of samples - the larger your phoneme set, the more phonemes will have very few samples (or none at all) they occur in causing instabilities in your model.
But even if you have a large sample base there is still the question what good additional phonemes will do to your model - will those additional phonemes really improve recognition performance or the quality of the speech produced? At some point you will also face the question of which phonemes actually exists in nature and which of them you want to model - after all, speech is a natural phenomenon analog world which you model model using discrete phonemes. In fact, even amongst linguists these questions seem debatable:
https://en.wikipedia.org/wiki/Phoneme#The_non-uniqueness_of_phonemic_solutions
one of my favorite examples in the german language is r vs ʀ vs ʁ - which one of them is used differs by region/dialect - so in this case it comes down to the question whether you want to model dialects in your pronunciation dictionary - in zamia, I definitely decided against that but of course other designers may decide otherwise for their phoneme set.
Thanks again for the background info. I see now, its not a trivial problem to solve :grin: .
So, back at the drawing board, what's actually the best way to generate new words for the Zamia lexicon.txt files? :man_shrugging: Is there a chance to use espeak (IPA or "normal") and get the correct set of supported phonemes? Or do we need to use the G2P models? Or do we need to implement a manual procedure (generate automatically, check if phonemes are ok, adapt by hand)?
NOTE: The reason why I would like to use espeak is because I can create the phoneme set by actually listening to it (looking at the original 'speech_lex_edit.py' file I think you had the same intention).
In my experience if you want high quality lexicon entries there is no way around checking them manually. In general I would use speech_lex_edit to add new entries to the dictionary (either directly or through speech_editor while reviewing samples). Inside that tool you have options to generate pronounciations via espeak, mary tts and sequitur g2p. Usually I would listen to all three and pick the best one, sometimes with manual improvements (like fixing stress marks etc.).
Hi,
I would like to use the pretrained acoustic model for English but use it in combination with a new in-domain language model, for which I have to generate pronunciations.
I am used to the Kaldi toolkit and the CMU dictionary, which uses the ARPA alphabet. I saw in your repo the script to convert the CMU dictionary to the IPA format but when I look at the phones.txt file associated to the acoustic model, I do not recognize the IPA format. For ex, to which phoneme tS corresponds to in the ARPA alphabet?
I hoope my question is clear enough.
Thank you for your answer!
CK