microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.2k stars 114 forks source link

Getting TTS output voice close to the training data - Finetuning on different language #57

Open Srija616 opened 1 year ago

Srija616 commented 1 year ago

Hi! I have fine-tuned SpeechT5 on one of our Hindi datasets transliterated to English. The pronunciation of words is quite good however the synthesized voice seems a bit mechanical and doesn't match that of training data (studio recorded male and female voice dataset). From what I understand, the synthesized speech depends on the speaker embeddings passed as argument to model.generate_speech and according to the fine-tuning colab tutorial, we can pass any speaker embeddings.

I would like to match the voice quality of the train dataset. I have trained the model for around 4000 steps at the same training hyperparams as defined in the Colab Finetuning official tutorial for the Dutch language.

Can you suggest ways to get close to the training data voice?

kdcyberdude commented 1 year ago

Any update on this?

Naman3007 commented 8 months ago

Hey can you pls share ur code file with me i am working on same project i want for reference