r9y9 / ttslearn

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)
https://r9y9.github.io/ttslearn/
MIT License
251 stars 37 forks source link

Transfer learning #31

Closed dtfasteas closed 2 years ago

dtfasteas commented 2 years ago

hello,

I went trough all the recipes and was looking for how transfer learning could be done in a similar way as described here

transfer learning

without having to rely upon that similar (parallel) wav/text is used as an input like in the multispeaker example.

thank you very much for an answer in advance.

best regards

r9y9 commented 2 years ago

Do you mean transfer learning by fine-tuning a model on a small dataset? You can find a fine-tuning example that may be helpful: https://github.com/r9y9/ttslearn/blob/59c6f491ce205cab611e171054af43afcc6ca603/extra_recipes/commonvoice/multispk_tacotron2_pwg_20spks/run.sh#L126-L133.

dtfasteas commented 2 years ago

I tried the fine tuning parameter already but I got no good results so far.

I used jsut as a base corpus for testing, then add another 1000 files on top of that.

Before the fine tuning the tts could speak very well and clearly, but after finetuning it seemed like it forgot about the base.

For example: 今日はいい天気ですか。was pronounced very well just with the jsut base. But after fine tuning its like it forgot how to speak half of the words.

So I was wondering if there are any other advice on this besides just turning on finetuning parameter.

The additional 1000 files are not very good quality and with different emotions. But I expected at least when I turn on finetuning that it could speak the sentences in some way as it did before finetuning.

r9y9 commented 2 years ago

I see. I would recommend you to try good quality data first. Machine learning models highly depend on the data. I would also recommend clearning your data as possible as you can; e.g., removing trailing and leading noise.

dtfasteas commented 2 years ago

Thank you very much, i try my best.