open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.19k stars 358 forks source link

[Help]: Symbols.dict in finetuning #226

Open AndreasPatakis opened 1 week ago

AndreasPatakis commented 1 week ago

Hello. I am using the pretrained model on 6k of libritts data and i try to finetune it in order for it to speak greek. I use approximately 3.5 hours of single speaker data, where each audio file is ~10 seconds long.

I finetune using the default config(batch size 4, epochs 20 etc) and i have changed the espeak tag to the greek one('el').

My main problem is that the model can't even come close to producing even one word in greek, nothing makes sense. The only thing that looks correct is that the length of the audio produced seems to be appropriate, given an input prompt.

As i have seen from the code the collater is used to assign ids to phones. The original, english, symbols dict has more phones than the one i produce in my greek dataset. That means that even common phones are assigned to different id.

Will this be an issue? Should my phone ids in the symbols dict have the same id as the english one which the model was trained on?

Thanks!

jiaqili3 commented 1 week ago

Hi @AndreasPatakis , thanks for trying to finetune the prertrained model to speak greek.

Will this be an issue? Should my phone ids in the symbols dict have the same id as the english one which the model was trained on?

Yes, this should be an issue. Actually, finetuning a TTS model to speak another language is quite challenging. To make this work, I think one prerequisite is that both languages share the model's vocabulary, (for example, the International Phenetic Alphabet). If the phone mapping in your g2p tool is different from that used in the pretrained model, then the wrong mapping will cause the problem of mispronunciation.

Even with the problem above been solved, I would doubt whether it only needs finetuning to let the model speak another language. I personally would suggest adding Greek data in the pretraining phase, i.e., you train another model from scratch. There are some models like VITS that don't require such large amount of data to get good performance.

Hope this would be helpful!