jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.53k stars 1.22k forks source link

Can anyone explain me what the boundaries are for? #82

Closed tuannvhust closed 1 year ago

tuannvhust commented 1 year ago

I just use the model to train my custom data containing wavs file which have sampling rate of 44100 Hz. I started training the model and it raised an error. I realized the boundaried [32,300,400,500,600,700,800,900,1000] in file train.py contributed to the error, so i change the baoundaries to [200,300,400,500,600,700,800,900,1000,1100]. The training process worked fine but i have some issues:

  1. the synthesized speech lost some information. For example, If i input : " I went to school to play basket ball with my friend. It was wonderful afternoon", the output will be an audio which only speaks " I went to school to play basket ball with my friend "
  2. The quality of the synthesized speech is very bad, There is noise and hard to hear clear

Anyone can help me with my problems. Thank you in advance

OnceJune commented 1 year ago

The use of boundaries please see comment https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/data_utils.py#L300 So issue 2 is clear, the training might drop too much data due to the lower limit 200.

For issue 1, it might caused by EOS predicted by encoder at the end of your first sentence.

tuannvhust commented 1 year ago

@OnceJune Do you have any idea how to solve issue 1? For issue 1, i think you might be true, can you clarify more? i mean i dont know what EOS is ?