Closed roedoejet closed 1 year ago
Also, I'm noticing that we don't explicitly mention that you have to potentially write your own raw-text preprocessor for your language in synthesize.py
to replace the default preprocess
function that exists there. When you synthesize, it should print out what the Raw Text Sequence
and Phoneme Sequence
are for the text you're synthesizing. If those don't look right, you might have to adjust that function as well which could be contributing to your issue.
Also, I'm noticing that we don't explicitly mention that you have to potentially write your own raw-text preprocessor for your language in
synthesize.py
to replace the defaultpreprocess
function that exists there. When you synthesize, it should print out what theRaw Text Sequence
andPhoneme Sequence
are for the text you're synthesizing. If those don't look right, you might have to adjust that function as well which could be contributing to your issue.
The raw text and phoneme sequences seem to look right so that shouldn't be a problem.
Not sure if this is the right place to ask, but I just wanted to double check if I should be enabling transformer.spe_features to true in model.yaml and preprocessing.use_spe_features to true as well in preprocess.yaml. In our email conversation you recommended to set them to false, but I read your documentation as well as this reply https://github.com/roedoejet/FastSpeech2/issues/4#issuecomment-1326803315 which suggests that I should be setting them to true since I'm using your g2p library with IPA instead of Arpabet. Would I be correct to set them to true instead?
Also I noticed that my recordings are sampled to 16kHz compared to the config setting of 22050Hz. Could that also have a large impact? (Although MFA uses a default sample frequency of 16000Hz for training)
Thanks for all the help.
The raw text and phoneme sequences seem to look right so that shouldn't be a problem.
OK good.
Not sure if this is the right place to ask, but I just wanted to double check if I should be enabling transformer.spe_features to true in model.yaml and preprocessing.use_spe_features to true as well in preprocess.yaml. In our email conversation you recommended to set them to false, but I read your documentation as well as this reply #4 (comment) which suggests that I should be setting them to true since I'm using your g2p library with IPA instead of Arpabet. Would I be correct to set them to true instead?
The phonological features setting changes the inputs from one-hot embeddings of the IPA (or orthographic) characters to multi-hot features. Imagine turning each character into a vector based on something like this (sourced from here:
If you were fine-tuning with another language with a different symbol set this would normalize the input space and make it easier to pre-train/fine-tune. I would just turn it to False for you though since you confirmed the Phoneme sequence is good and you don't plan on pre-training/fine-tuning. Any improvements you would get would likely be minimal.
Also I noticed that my recordings are sampled to 16kHz compared to the config setting of 22050Hz. Could that also have a large impact? (Although MFA uses a default sample frequency of 16000Hz for training)
Yes, you'll need to use 22050 Hz inputs to FastSpeech2 since HiFiGAN expects 22050 Hz inputs as well.
Thanks for all the help.
You're welcome - good luck!
I think I've managed to create a decent quality voice using the system! Unfortunately my output still has some static noise, but I think that might just be due to my recording data not being the highest quality since many recordings themselves had noise/static. If there is a way to fix the noise other than getting better data let me know! (but probably not 😢 )
I'm not really sure how it fixed this issue with long sentences, but here is what I changed (and didn't change):
Thank you so much for the effort and work you have put into helping me and maintaining this project! (can close the issue now unless you have an idea on fixing the noise?)
Good to know. Here are some suggestions:
I didn't trim the silences since my audio data had a large spread of volumes and there wasn't an easy automatic command to remove silences to an appropriate level
Normalize your audio first. Something like sox −−norm=−3 in.wav out.wav
. You can also use percentages. Maybe just try removing silence at the beginning and end (sox in.wav out.wav silence 1 0.1 0.1% reverse silence 1 0.1 0.1% reverse
) instead before running through MFA.
Kept use_energy_predictor set to true (for some reason it produces rubbish for me when using my own language but it does work for LJSpeech/english to a good quality)
Interesting!
For de-noising, I can recommend RNNoise. There are other models with more sophisticated neural architectures like https://github.com/haoheliu/voicefixer, but I actually still find that RNNoise just works better and more reliably for TTS. Your silence removal and even your alignments might run better after you de-noise this way.
Second, as a long-term plan to reduce some of the artifacts you could fine-tune the universal hifi-gan checkpoint you are using with the outputs from your FastSpeech2 model. This should remove some of the metallic artifacts from the synthesis.
I'll close this issue though - good luck @jzhu709 and congratulations for persevering :)
Copied from email correspondence with @jzhu709:
I recommend the following:
Please post here with any updates you have and whether any of the above suggestions help fix your issue with longer utterances.