Hi! I have fine-tuned SpeechT5 on one of our Hindi datasets transliterated to English. The pronunciation of words is quite good however the synthesized voice seems a bit mechanical and doesn't match that of training data (studio recorded male and female voice dataset). From what I understand, the synthesized speech depends on the speaker embeddings passed as argument to model.generate_speech and according to the fine-tuning colab tutorial, we can pass any speaker embeddings.
I would like to match the voice quality of the train dataset. I have trained the model for around 4000 steps at the same training hyperparams as defined in the Colab Finetuning official tutorial for the Dutch language.
Can you suggest ways to get close to the training data voice?
Hi! I have fine-tuned SpeechT5 on one of our Hindi datasets transliterated to English. The pronunciation of words is quite good however the synthesized voice seems a bit mechanical and doesn't match that of training data (studio recorded male and female voice dataset). From what I understand, the synthesized speech depends on the speaker embeddings passed as argument to model.generate_speech and according to the fine-tuning colab tutorial, we can pass any speaker embeddings.
I would like to match the voice quality of the train dataset. I have trained the model for around 4000 steps at the same training hyperparams as defined in the Colab Finetuning official tutorial for the Dutch language.
Can you suggest ways to get close to the training data voice?