Open JiachuanDENG opened 1 year ago
@JiachuanDENG See section 4.4 in their paper. When they ran the original hifigan model (not fine-tuned) on the output of Tacotron2, the quality was good but not good enough. When they looked at the errors, they concluded that most of the error was coming from Tacotron2, not the vocoder. So the idea of fine-tuning on the output of the front-end is that the vocoder will learn to correct the errors of the front-end. If you train on the ground truth, you may not be able to correct the front-end errors. Of course, if you intend to synthesize using a different front-end, you should train on the output of that front-end, not Tacotron.
Personally, I would have liked to see an experiment where they just fine-tune on the ground-truth of the target speaker as you suggested and compared the output to the experiment they ran. But I trust that their conclusion is correct. I'm going to try to run my own experiments this week and see what happens (using FS2, not Tacotron).
May I ask why we need to finetune on Tacotron output? Given that we can get the ground truth Mel-spect from the original wavform audio, why bother trying to learn acting like Tacotron ? Can anyone give me a intuitive explanation?