jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.91k stars 1.27k forks source link

Why using z but not flow(z_p, spk_emb) for decoder ? #167

Open thanhkm opened 1 year ago

thanhkm commented 1 year ago

Hello, thank you for your great project!

I wonder is there any underlying reason to use z instead of z_p + spk_emb for decoder? The second schema could be post_encoder -> z - spk_emb -> z_p -> z_p + spk_emb -> z' -> wav. Will it make the flow more robust in the inference step?

Best regards