keonlee9420 / DiffGAN-TTS

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs
MIT License
310 stars 45 forks source link

Is adversarial training actually necessary? #13

Open nguyenhungquang opened 2 years ago

nguyenhungquang commented 2 years ago

I realise that when I remove adversarial loss and feature match loss, it still works well and has no degeneration of performance. This makes me question the role of adversarial training in reduction of inference steps, or this this task is simple enough to learn directly with denoise model. Here are samples from two models https://drive.google.com/drive/folders/1uvURiQkOrP9n1jJsKyNe9NcSO4AfdFID?usp=sharing

keonlee9420 commented 2 years ago

Hi @nguyenhungquang , thanks for sharing your insight. I also found the same result when I built this repo with the comparison of DiffSinger and DiffGAN-TTS. My conclusion was also that the task from LJSpeech is too easy. In my opinion, the GAN training will serve to be generalized with small steps when the dataset had more expressive and noisy speech.

nguyenhungquang commented 2 years ago

@keonlee9420 Thank you. I've also trained with my dataset, which is a bit noisy, and it performs well. Though melspec is more clear when I visualise, it's unlikely to get noticed when listen. I think the difference might be more visible for multi-speaker dataset

keonlee9420 commented 2 years ago

Good catch. I think it does make sense.