Open nguyenhungquang opened 2 years ago
Hi @nguyenhungquang , thanks for sharing your insight. I also found the same result when I built this repo with the comparison of DiffSinger and DiffGAN-TTS. My conclusion was also that the task from LJSpeech is too easy. In my opinion, the GAN training will serve to be generalized with small steps when the dataset had more expressive and noisy speech.
@keonlee9420 Thank you. I've also trained with my dataset, which is a bit noisy, and it performs well. Though melspec is more clear when I visualise, it's unlikely to get noticed when listen. I think the difference might be more visible for multi-speaker dataset
Good catch. I think it does make sense.
I realise that when I remove adversarial loss and feature match loss, it still works well and has no degeneration of performance. This makes me question the role of adversarial training in reduction of inference steps, or this this task is simple enough to learn directly with denoise model. Here are samples from two models https://drive.google.com/drive/folders/1uvURiQkOrP9n1jJsKyNe9NcSO4AfdFID?usp=sharing