Open tweeter0830 opened 7 months ago
:+1: same question here!
I was able to produce some sounds but the quality is... mediocre? How can we improve it?
edit: changing the seed parameters and also making the target transcript to be only 1-2 sentences helped a bit. (longer sentences causes the pitch to change for some reason)
How long target transcript? The model is trained on short sentences (evarage length 5 sec, although the longest training data goes to 20sec), so you might want to finetune it on long utterances if that's you testing scenario
without finetuning, you could try increasing sample_batch_size
and decreasing stop_repetition
In general, the current model is not trained to do TTS - it's trained to do speech editing, but it happens to generalize to TTS. I'm finetunig the model on a TTS objective, and will release that model soon
Thank you! I was using reference audio up to 12 seconds long + target transcript which is about 4 seconds long.
I’ll try using a reference which is about 4 seconds + target of 4 seconds? Does that sound ok?
Also, when doing text to speech, I just concatenate the reference transcript and target transcript together and set prompt_end_frame to -1. Is that the correct thing to do?
Thank you! I was using reference audio up to 12 seconds long + target transcript which is about 4 seconds long.
I’ll try using a reference which is about 4 seconds + target of 4 seconds? Does that sound ok?
Also, when doing text to speech, I just concatenate the reference transcript and target transcript together and set prompt_end_frame to -1. Is that the correct thing to do?
all sound good
Some times the speaker similarity can be a bit off, it's like the model uses a different voice than the prompt.
One thing that I found can improve speaker similarity in those situations is to make sure that the prompt is not an entire sentence, it should be instead an unfinished sentence, and therefore the model will better follow the voice
due to the noisy nature of gigaspeech, some of the training utterances have a speaker switch, i.e. two speakers takes turn to speak in the same training utterance.
The TTS finetuned 330M model is up, should be better than the 830M one
The TTS finetuned 330M model is up, should be better than the 830M one
Thank you for the release of the fine-tuned 330M TTS model. Its performance and efficiency are impressive. Your work is greatly appreciated, and I'm keen to see how it evolves to further support real-time use cases. Are there plans to develop future models with an emphasis on optimizing for real-time TTS applications?
Thanks for the great model!
Do you have any tips when using the model to clone voices for text to speech?
I'm converting the reference wav files to 16000 sample rate and the same format as the example wav file in the repo.
However, the performance of the model doesn't seem that great. It often can only mimic the general tone and gender of the reference and often has pauses or slurring.
I'm calling it like this:
Am I missing something? Thank you!