Open linlinsongyun opened 1 year ago
What dataset do you use in pretrain stage ?
And is the language in pretrain and finetune the same ?
A mandarin multi-speaker dataset was used for pretraining. Another Chinese speaker was used for finetuning.
I mentioned that only the decoder and speaker embeddings have gradients during finetune. If the decoder weights should have no grad except the condition layer norm?
Do you set num_speaker in model config equal to number of speakers in mandarin dataset in pretrain stage?
I mentioned that only the decoder and speaker embeddings have gradients during finetune. If the decoder weights should have no grad except the condition layer norm?
Only speaker embedding and condition layernorm. I follow the paper
Do you set num_speaker in model config equal to number of speakers in mandarin dataset in pretrain stage?
yes. i use the default config "num_speaker: 955". There are 30 speakers in the pretrain stage, whose speaker id are ranging from 1 to 31. And i use speaker_id=50 in the finetune stage.
You have to change default config "num_speaker" equal to 30 (in your case) in pretrain stage. When finetune, just set your speaker_id = 0.
You have to change default config "num_speaker" equal to 30 (in your case) in pretrain stage. When finetune, just set your speaker_id = 0.
ok, i will have a try. Thanks a lot.
@linlinsongyun did the finetuning improve after you changed the number of speakers?
Thanks for your nice work. The code works well with the pretrain stage. However, when i finetune towards an unseen voice with 10 sentences, the results is bad. The speech quality is bad, and the voice is significantly different. what went wrong?