mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.44k stars 1.26k forks source link

Problems with French TTS Model #668

Closed lpierron closed 3 years ago

lpierron commented 3 years ago

I'm trying to improve French Tacotron2 DDC, because there is some noises you don't have in English synthesizer made with Tacotron 2. There is also some pronunciation defaults on nasal fricatives, certainly because missing phonemes (ɑ̃, ɛ̃) like in œ̃n ɔ̃ɡl də ma tɑ̃t ɛt ɛ̃kaʁne (Un ongle de ma tante est incarné.)

I started to train text2feat, from scratch on French corpus (MAI_ezwa) but after 10k to 15k the loss increase drastically and the result is not so bad with your vocoder, but you told that you trained the model only for 100k steps, 10 times as I do. How can I arrive to 100k steps and more.

I join the config file.

Thanks

If I understand well:

  • it will transform the text to ASCII -> what seems to be an issue for french because then there will be no différence between e, é and (è, ế or ë)
  • the abreviation, symbols will be transform into the english one rather than the french one...

Am I misundertsanding something?

I did the implementation of french_cleaners myslef:

def french_cleaners(text):
    '''Pipeline for French text. There is no need to expand numbers, phonemizer already does that'''
    text = lowercase(text)
    text = expand_abbreviations(text, lang='fr')
    text = replace_symbols(text, lang='fr')
    text = remove_aux_symbols(text)
    text = collapse_whitespace(text)
    return text

as you can see, the text doesn't go through the convert_to_ascii method. I also added support for french abbreviations here.

There is a big bug in your expand_abbreviations in French. I send you the new one. I send also a new symbols.py, with the missing nasal vowels. abbreviations_symbols.zip

Originally posted by @lpierron in https://github.com/mozilla/TTS/issues/539#issuecomment-783408544

WeberJulian commented 3 years ago

Hi ! I trained a model with those french cleaners and my model doesn't struggle to say the sentence you wrote. (Un ongle de ma tante est incarné) Here is the link of the audio sample: https://sndup.net/3rcw

lpierron commented 3 years ago

Hi ! I trained a model with those french cleaners and my model doesn't struggle to say the sentence you wrote. (Un ongle de ma tante est incarné) Here is the link of the audio sample: https://sndup.net/3rcw

Yes it's correct !!! Which release of Mozilla TTS have you used. I had to add the missing phonemes, and now I have to train again my model because it has 2 more phonemes. Ii it possible to have more information about your model:

lpierron commented 3 years ago

Could you try this sentence: "Les députés Républicains sont indépendants." !

My result : https://sndup.net/54gr

Thanks

WeberJulian commented 3 years ago

This sentence sounds good to me as well: https://sndup.net/7ysm Here is my config file: https://drive.google.com/file/d/1MOOdw1kpQerKU2NjUYijNHSktAH75IqE/view?usp=sharing and the vocoder is wavegrad, shared on the wiki

lpierron commented 3 years ago

Oh yes, it's very good. The major bug in french_cleaners (replacing all 'm' in words by 'monsieur') doesn't seem to affect the training. The two missing nasal vowels doesn't affect your french synthesizer.

WeberJulian commented 3 years ago

That's indeed a huge problem, I don't even know why I didn't notice it. I may have copied the english abreviations without extensive testing afterward. At any rate, it prooves that the model is resilient to perturbations ^^

lpierron commented 3 years ago

I see you have set the sample rate at 22kHz, but our MAILabs corpus is 16kHz, did you interpolate the datas for upscaling to 22kHz ? I'm training with ezwa from MaiLabs.

WeberJulian commented 3 years ago

I trained using the full mailbas dataset (all speakers) and yeah I resampled the dataset to 22kHz

lpierron commented 3 years ago

I trained using the full mailbas dataset (all speakers) and yeah I resampled the dataset to 22kHz

I'm using your config.json file, I upscaled the waves files from MAILabs at 22050 Hz, and I cannot train above 8k steps, when I was at 16k the problem occurred at 15k steps.`

You can see my tensorboard: http://tts.lpcomprise.eu/#scalars

image

And my job stops with an error:

Traceback (most recent call last):
  File "../TTS/TTS/bin/train_tacotron.py", line 721, in <module>
    main(args)
  File "../TTS/TTS/bin/train_tacotron.py", line 619, in main
    train_avg_loss_dict, global_step = train(train_loader, model,
  File "../TTS/TTS/bin/train_tacotron.py", line 180, in train
    loss_dict = criterion(postnet_output, decoder_output, mel_input,
  File "/home/lpierron/anaconda2/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", lin
e 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/lpierron/Mozilla_TTS/TTS/TTS/tts/layers/losses.py", line 398, in forward
    raise RuntimeError(f" [!] NaN loss with {key}.")
RuntimeError:  [!] NaN loss with decoder_loss.
lpierron commented 3 years ago

Is it possible to have your models of French or are they private ? Just the part text2feat, because I verified that the vocoder part is perfect when you enter signal as 22k mels and pretty good with 16k mels.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

ytl3274 commented 2 years ago

I have seen the same problem in my Hakka(a Chinese dialect) constantly. This is my first training with my own datasets. See error message below: Traceback (most recent call last): File "TTS/TTS/bin/train_tacotron.py", line 721, in main(args)RuntimeError: [!] NaN loss with decoder_loss.

File "TTS/TTS/bin/train_tacotron.py", line 623, in main scaler_st) File "TTS/TTS/bin/train_tacotron.py", line 184, in train text_lengths) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/content/TTS/TTS/tts/layers/losses.py", line 398, in forward raise RuntimeError(f" [!] NaN loss with {key}.") ! Run is kept in /home/yuten/work/hakka-ddc-March-27-2022_09+31PM-0000000

Here is my config file.

I have tried several times in Colab and the training always hits the exception around 2k steps. I am new to TTS. Please let me know if my config or datasets have problems. Thanks.

Here is my Colab link: https://colab.research.google.com/drive/1HSZKJkDwwuFv9oLq7Dvti_4QKTn-efpe#scrollTo=etkwAVQUQc2k dataset(zip)+code(py)+work-log https://drive.google.com/drive/folders/1xfLEnriP_NtheaZ9N2Befn8drw4O92z2?usp=sharing train+test+val(csv and text files) https://drive.google.com/drive/folders/15JCkCF5MwkU-KI613tNFoRDdg5DELikQ?usp=sharing