rhasspy / gruut

A tokenizer, text cleaner, and phonemizer for many human languages.
MIT License
279 stars 36 forks source link

Adding catalan language #8

Open ccoreilly opened 3 years ago

ccoreilly commented 3 years ago

I would like to contribute by adding support for the catalan language to gruut (and gruut-ipa / ipa2kaldi) but I am not sure about the g2p model.

I have a phonetisaurus g2p model which outputs CMU phonemes and the corresponding dictionary, would that suffice or should the model output IPA phonemes? I could maybe manually map the CMU phonemes to IPA and retrain the model.

I have also seen you have extracted g2p models from espeak-ng, how could I do so? Or have you converted a lexicon to its IPA phonetic representation with espeak and then trained a g2p model based on that?

synesthesiam commented 3 years ago

Hi @ccoreilly, thanks for offering to volunteer!

When adding a new language, my first step is to add the phonemes to gruut-ipa. These should be IPA, and I usually just use a Wikipedia page.

If you can manually map the CMU phonemes to IPA, that would be great. If you follow the convention here for English, it will be possible for gruut-ipa to convert between the CMU and IPA phonemes automatically.

I have also seen you have extracted g2p models from espeak-ng, how could I do so?

I created a small script for this. I start by creating a list of words, usually just the words from my lexicon plus a list of frequent words in the language (I have one for Catalan). Make sure to lower-case and de-duplicate the words. Then I create the espeak-ng lexicon like this:

./espeak_word.sh < words.txt > lexicon.espeak.txt

After that, converting it to a database is straightforward:

python3 -m gruut.lexicon2db --casing lower --lexicon lexicon.espeak.txt --database espeak/lexicon.db

I train separate g2p models for IPA and espeak-ng phonemes. See below for instructions on that, and let me know if you have any questions :slightly_smiling_face:

G2P

Recent versions of gruut aren't using Phonetisaurus at runtime anymore to reduce the runtime dependencies. I'm hoping to add support for reading the g2p FSTs in pure Python, but for now I'm using a different framework.

Training still needs Phonetisuarus, however, for initial alignment of the corpus. If you're using my phonetisaurus Python package, you can get this when you train a model:

phonetisaurus train --corpus g2p.corpus --model g2p.fst lexicon.txt

The g2p.corpus file contains the alignments for all words in the lexicon. You use this to train a model in my new framework like this:

python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf
ccoreilly commented 3 years ago

Thanks for the thorough response Michael! I have been a bit busy lately but will make time to contribute.

mlrober commented 3 years ago

Hi Michael,

i'm trying to add new language and created model.fst and model.corpus with phonetisaurus. Howver, when i try to run the below command to get "model.crt" with :

python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf

i'm getting error as

zsh: killed python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf

that's it. Any idea or troubleshooting steps to get rid of this or any other way to get model.crt ?

synesthesiam commented 3 years ago

How big is your pronunciation dictionary? Is it eating up all of your memory?

mlrober commented 2 years ago

Thanks for the reply. The corpus file is of 23M size Is it too big to train? what would be the ideal size?

ccoreilly commented 2 years ago

@mlrober are you working on Catalan or another language? (I haven't had the time so it'd be great if your questions were specific to the catalan language :)

mlrober commented 2 years ago

Hi Michael,

I'm working on another language however i put a comment on the catalin language query. I reduced the file size and it is done. However, i found the "loss:" parameter contains higher no i mean i have around 1,78,000 words and is howling all the nos in loss? Is it intended?

Appreciate for your response.

On Fri, Nov 5, 2021 at 1:02 PM ccoreilly @.***> wrote:

@mlrober https://github.com/mlrober are you working on Catalan or another language? (I haven't had the time so it'd be great if your questions were specific to the catalan language :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rhasspy/gruut/issues/8#issuecomment-961681333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVPVACB5YQKDZI6C2OIS7BTUKOCCBANCNFSM464HG6KA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

synesthesiam commented 2 years ago

I guess we can consider this thread as "adding a new language" more generally :slightly_smiling_face:

@mlrober, can you clarify what "howling all the nos in loss" means? Sorry, I can't quite interpret it :confused:

mlrober commented 2 years ago

Hi Michael,

Sure. I was saying that after completing model training l, I got some results stating that scores variable is empty and loss variable is having all no of words. Here I'm bit confused is the model trained properly or not?

Also, what are the steps we need to follow to train glow TTS model and how many hours of data required? Sorry if it goes out of context l. Kindly let me know

Thanks,

On Sat, Nov 6, 2021, 02:28 Michael Hansen @.***> wrote:

I guess we can consider this thread as "adding a new language" more generally 🙂

@mlrober https://github.com/mlrober, can you clarify what "howling all the nos in loss" means? Sorry, I can't quite interpret it 😕

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rhasspy/gruut/issues/8#issuecomment-962215635, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVPVACAL2J4J6SCIQJ5HLBLUKRHOTANCNFSM464HG6KA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.