Closed bharat-patidar closed 4 years ago
The way the model is trained is through a speaker verification task that happens to transfer well to TTS. The training involves thus many speakers and the more speakers the higher the quality of the embeddings, hence finetuning on a single speaker makes little sense.
Finetuning tacotron on your single speaker embedding does make sense, but it's going to be hard with my other repo the way it is.
Hi Corentin, Thanks for providing this amazing repo. I was just going through the speaker diarization script and tried to create embedding of my voice. I tested it on multiple audios recorded from different devices and got decent results. Would you suggest retraining/fine-tuning this model for better accuracy? Or I should experiment on other available pretrained models for better results?