litagin02 / Style-Bert-VITS2

Style-Bert-VITS2: Bert-VITS2 with more controllable voice styles.
GNU Affero General Public License v3.0
717 stars 86 forks source link

What's the benefits of this model? #57

Closed BankNatchapol closed 8 months ago

BankNatchapol commented 8 months ago

Hello, thank you for your incredible work! I'm curious about how this model compares to other state-of-the-art models such as StyleTTS2 or Tortoise TTS. Unfortunately, I couldn't find the original paper, and most resources are in Japanese/Chinese TT, making it challenging for me to gather information.

litagin02 commented 8 months ago

I haven't fully checked other TTS (except Bert-VITS2, GPT-SoVITS, and ESPnet VITS), but at least for Japanese language, (Style-)Bert-VITS2 produces the most natural result with fast training.

Also since BERT is combined, it strongly reflects emotion and meaning of text, so happy sentences are read happily, sad sentences are read sadly. As for the model architecture, this point is unique I think.

BankNatchapol commented 8 months ago

I haven't fully checked other TTS (except Bert-VITS2, GPT-SoVITS, and ESPnet VITS), but at least for Japanese language, (Style-)Bert-VITS2 produces the most natural result with fast training.

Also since BERT is combined, it strongly reflects emotion and meaning of text, so happy sentences are read happily, sad sentences are read sadly. As for the model architecture, this point is unique I think.

Thanks!! I believe many recent models incorporate similar style + pitch embedding. My assumption is that this model might excel in terms of speed and resource efficiency, not sure😅. Ill do some test to verify.