hcy71o / SC-CNN

SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
MIT License
39 stars 6 forks source link

Acoustic feature transfering question #2

Closed WorkingJack closed 1 year ago

WorkingJack commented 1 year ago

Hello, I'm new to this filed and still a little bit confused, I got few questions about the acoustic feature transfer.

  1. In my understanding, this zero shot TTS transfer the speaker's voice from reference audio to do the synthesis, I'm wondering will it also transfer the speech style such as pitch, intonation, speaking rate, rhythm, volume, or emotional expression?
  2. If this zero shot TTS does not transfer speech style, do you have any suggestion for doing style control or provide any kind of controllability base on VITS model?

I understand these questions are not that relevant to the repo issue, but any kind of help is appreciated, thank you!!

hcy71o commented 1 year ago
  1. Right. Every style information is transferred to generated speech.
  2. There have been many studies related to style transfer including explicit style control. In my opinion, I recommend to read PeriodVITS or PITS(https://arxiv.org/abs/2302.12391) for pitch controllablility.
WorkingJack commented 1 year ago

Thanks!!