kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.56k stars 341 forks source link

TTS & Vocoder Fine Tuning #318

Closed gevmin94 closed 2 years ago

gevmin94 commented 2 years ago

When we use pre-trained HiFi vocoder on FastSpeech2, Fastpitch synthesized mel features then vocoded audio contains artifacts (noise). After fine tuning the audio quality significantly get improved. Is this expected behaviour or without fine tuning we should get clean audios without strange artifacts using HiFi or ParallellWaveGAN? ( we experimented on different datasets )

kan-bayashi commented 2 years ago

Is this expected behaviour or without fine tuning we should get clean audios without strange artifacts using HiFi or ParallellWaveGAN?

The GAN-based vocoder without noise inputs has the tendency (e.g., MelGAN, HiFiGAN), which causes metallic sound noise using TTS outputs. On the other hand, vocoders with noise inputs (e.g., PWG, StyleMelGAN) can reduce such a noise without fine-tuning.

Maybe our paper's results help you. https://espnet.github.io/icassp2022-tts/ https://arxiv.org/abs/2110.07840