Hello, I'm new to this filed and still a little bit confused, I got few questions about the acoustic feature transfer.
In my understanding, this zero shot TTS transfer the speaker's voice from reference audio to do the synthesis, I'm wondering will it also transfer the speech style such as pitch, intonation, speaking rate, rhythm, volume, or emotional expression?
If this zero shot TTS does not transfer speech style, do you have any suggestion for doing style control or provide any kind of controllability base on VITS model?
I understand these questions are not that relevant to the repo issue, but any kind of help is appreciated, thank you!!
Right. Every style information is transferred to generated speech.
There have been many studies related to style transfer including explicit style control. In my opinion, I recommend to read PeriodVITS or PITS(https://arxiv.org/abs/2302.12391) for pitch controllablility.
Hello, I'm new to this filed and still a little bit confused, I got few questions about the acoustic feature transfer.
I understand these questions are not that relevant to the repo issue, but any kind of help is appreciated, thank you!!