jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.48k stars 1.21k forks source link

Inference result is not as good as the demo #121

Open ali-elkahky opened 1 year ago

ali-elkahky commented 1 year ago

Hi I have a n00b question. I am using the inference script provided with the pretrained model "pretrained_ljs.pth" and result has a noticeable noise and not close to the quality of the demo. Is that expected ?

NikitaKononov commented 1 year ago

Hi I have a n00b question. I am using the inference script provided with the pretrained model "pretrained_ljs.pth" and result has a noticeable noise and not close to the quality of the demo. Is that expected ?

Hi. Try to play with inference params. And of course result depends on your text's length and construction.

ali-elkahky commented 1 year ago

Thanks a lot for the response. Is there an ideal example of text/ inference parameters that should produce results similar to the demo ?

NikitaKononov commented 1 year ago

Thanks a lot for the response. Is there an ideal example of text/ inference parameters that should produce results similar to the demo ?

Can't suggest parameters values, coz I'm still experimenting with them by myself. Ideal examples of texts (for LJSpeech) are in LJSpeech filelists. You can try texts from test and val files - they are not used in training proccess

NikitaKononov commented 1 year ago

But LJSpeech is veeeery boring voice. It's very sad that current SOTA models are tested with LJS... It has no emotions. Even tacotron sounds well with it

Damarcreative commented 7 months ago

This is my Jupyter Notebook code, several models have been provided: https://github.com/Damarcreative/anime-tts/blob/main/inference.ipynb