Acoustic feature transfering question

WorkingJack commented 1 year ago

Hello, I'm new to this filed and still a little bit confused, I got few questions about the acoustic feature transfer.

In my understanding, this zero shot TTS transfer the speaker's voice from reference audio to do the synthesis, I'm wondering will it also transfer the speech style such as pitch, intonation, speaking rate, rhythm, volume, or emotional expression?
If this zero shot TTS does not transfer speech style, do you have any suggestion for doing style control or provide any kind of controllability base on VITS model?

I understand these questions are not that relevant to the repo issue, but any kind of help is appreciated, thank you!!

hcy71o commented 1 year ago

Right. Every style information is transferred to generated speech.
There have been many studies related to style transfer including explicit style control. In my opinion, I recommend to read PeriodVITS or PITS(https://arxiv.org/abs/2302.12391) for pitch controllablility.

WorkingJack commented 1 year ago

Thanks!!

hcy71o / SC-CNN