Open xiao1ongbao opened 1 month ago
If one takes the G_0.pth (the first checkpoint) during training and uses it for inference, it speaks English with a young female voice that doesn't match the audio clips being trained on. So, it seems that it is fine-tuning that starting point.
As for duration of audio, I have gotten reasonable results with only 5 minutes of audio and 1k epochs with 48khz wav. Most people use 1+ hours, however.
To train a new voice for English, how many hours of audio do you recommend? Does the training script train from scratch or finetunes the existing model? Thanks!