Issue synthesizing longer utterances

roedoejet / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

MIT License

22 stars 7 forks source link

Issue synthesizing longer utterances #7

Closed roedoejet closed 1 year ago

roedoejet commented 1 year ago

Copied from email correspondence with @jzhu709:

I'm having problems with the output the model produces for long sentences. The model seems to be able to synthesise individual words mostly fine, but in sentences with even more than just four words it deteriorates significantly or the sound just cuts off sometimes. Do you know what could be the problem? Maybe some parameter I'm missing needs to be changed?

I recommend the following:

Trim silence in the recordings
When aligning with MFA, make sure that your language's writing system doesn't use any characters that [MFA interprets as punctuation](https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/configuration/global.html#:~:text=trained%20silence%20probabilities-,punctuation,-%E3%80%81%E3%80%82%E0%A5%A4%EF%BC%8C%40%3C%3E%E2%80%9D()%2C.%3A%3B%C2%BF%3F%C2%A1!%5C%26%25%23*~%E3%80%90%E3%80%91%EF%BC%8C%E2%80%A6%E2%80%A5%E3%80%8C%E3%80%8D%E3%80%8E%E3%80%8F%E3%80%9D%E3%80%9F%E2%80%B3%E2%9F%A8%E2%9F%A9%E2%99%AA%E3%83%BB%E2%80%B9%E2%80%BA%C2%AB%C2%BB%EF%BD%9E%E2%80%B2%24%2B%3D) - for example some languages might use ' as a glottal stop. In this case, you need to tell MFA to not process ' as punctuation
When aligning with MFA, pay attention to any warnings that exist and remove any utterances that are poorly aligned. I also recommend doing some sanity checks with the resulting TextGrids, by viewing them in Praat to make sure they look accurate.
Calculate the speaking rate on your data (characters/second or words/second) and remove extreme outliers

Please post here with any updates you have and whether any of the above suggestions help fix your issue with longer utterances.

roedoejet commented 1 year ago

Also, I'm noticing that we don't explicitly mention that you have to potentially write your own raw-text preprocessor for your language in synthesize.py to replace the default preprocess function that exists there. When you synthesize, it should print out what the Raw Text Sequence and Phoneme Sequence are for the text you're synthesizing. If those don't look right, you might have to adjust that function as well which could be contributing to your issue.

jzhu709 commented 1 year ago

Also, I'm noticing that we don't explicitly mention that you have to potentially write your own raw-text preprocessor for your language in synthesize.py to replace the default preprocess function that exists there. When you synthesize, it should print out what the Raw Text Sequence and Phoneme Sequence are for the text you're synthesizing. If those don't look right, you might have to adjust that function as well which could be contributing to your issue.

The raw text and phoneme sequences seem to look right so that shouldn't be a problem.

Not sure if this is the right place to ask, but I just wanted to double check if I should be enabling transformer.spe_features to true in model.yaml and preprocessing.use_spe_features to true as well in preprocess.yaml. In our email conversation you recommended to set them to false, but I read your documentation as well as this reply https://github.com/roedoejet/FastSpeech2/issues/4#issuecomment-1326803315 which suggests that I should be setting them to true since I'm using your g2p library with IPA instead of Arpabet. Would I be correct to set them to true instead?

Also I noticed that my recordings are sampled to 16kHz compared to the config setting of 22050Hz. Could that also have a large impact? (Although MFA uses a default sample frequency of 16000Hz for training)

Thanks for all the help.

roedoejet commented 1 year ago

The raw text and phoneme sequences seem to look right so that shouldn't be a problem.

OK good.

Not sure if this is the right place to ask, but I just wanted to double check if I should be enabling transformer.spe_features to true in model.yaml and preprocessing.use_spe_features to true as well in preprocess.yaml. In our email conversation you recommended to set them to false, but I read your documentation as well as this reply #4 (comment) which suggests that I should be setting them to true since I'm using your g2p library with IPA instead of Arpabet. Would I be correct to set them to true instead?

The phonological features setting changes the inputs from one-hot embeddings of the IPA (or orthographic) characters to multi-hot features. Imagine turning each character into a vector based on something like this (sourced from here:

If you were fine-tuning with another language with a different symbol set this would normalize the input space and make it easier to pre-train/fine-tune. I would just turn it to False for you though since you confirmed the Phoneme sequence is good and you don't plan on pre-training/fine-tuning. Any improvements you would get would likely be minimal.

Also I noticed that my recordings are sampled to 16kHz compared to the config setting of 22050Hz. Could that also have a large impact? (Although MFA uses a default sample frequency of 16000Hz for training)

Yes, you'll need to use 22050 Hz inputs to FastSpeech2 since HiFiGAN expects 22050 Hz inputs as well.

Thanks for all the help.

You're welcome - good luck!

jzhu709 commented 1 year ago

I think I've managed to create a decent quality voice using the system! Unfortunately my output still has some static noise, but I think that might just be due to my recording data not being the highest quality since many recordings themselves had noise/static. If there is a way to fix the noise other than getting better data let me know! (but probably not 😢 )

I'm not really sure how it fixed this issue with long sentences, but here is what I changed (and didn't change):

I didn't trim the silences since my audio data had a large spread of volumes and there wasn't an easy automatic command to remove silences to an appropriate level
Fixed the MFA TextGrid alignments by using the --no_textgrid_cleanup option when training the aligner which stops characters being mistaken as punctuation
Resampled the audio to 22050 Hz
Kept use_energy_predictor set to true (for some reason it produces rubbish for me when using my own language but it does work for LJSpeech/english to a good quality)

Thank you so much for the effort and work you have put into helping me and maintaining this project! (can close the issue now unless you have an idea on fixing the noise?)

roedoejet commented 1 year ago

Good to know. Here are some suggestions:

I didn't trim the silences since my audio data had a large spread of volumes and there wasn't an easy automatic command to remove silences to an appropriate level

Normalize your audio first. Something like sox −−norm=−3 in.wav out.wav. You can also use percentages. Maybe just try removing silence at the beginning and end (sox in.wav out.wav silence 1 0.1 0.1% reverse silence 1 0.1 0.1% reverse) instead before running through MFA.

Kept use_energy_predictor set to true (for some reason it produces rubbish for me when using my own language but it does work for LJSpeech/english to a good quality)

Interesting!

For de-noising, I can recommend RNNoise. There are other models with more sophisticated neural architectures like https://github.com/haoheliu/voicefixer, but I actually still find that RNNoise just works better and more reliably for TTS. Your silence removal and even your alignments might run better after you de-noise this way.

Second, as a long-term plan to reduce some of the artifacts you could fine-tune the universal hifi-gan checkpoint you are using with the outputs from your FastSpeech2 model. This should remove some of the metallic artifacts from the synthesis.

I'll close this issue though - good luck @jzhu709 and congratulations for persevering :)