jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.48k stars 1.21k forks source link

How many training epochs are required to hear the content of synthetic speech clearly? #128

Closed GoArsenal closed 1 year ago

GoArsenal commented 1 year ago

# utterance = 10000 # batch size = 16

SODAsoo07 commented 1 year ago

As of batch size 12, not included silence utterance sample 42min, I was get to clear results in steps 4 epoch, 330K Step.

LanglyAdrian commented 1 year ago

@SODAsoo07, hi! Have you trained on the VCTK dataset? Can you try generating a wav for "capital" (any voice)? Was the result good? Sounds out the whole word?