Closed gevmin94 closed 2 years ago
Is this expected behaviour or without fine tuning we should get clean audios without strange artifacts using HiFi or ParallellWaveGAN?
The GAN-based vocoder without noise inputs has the tendency (e.g., MelGAN, HiFiGAN), which causes metallic sound noise using TTS outputs. On the other hand, vocoders with noise inputs (e.g., PWG, StyleMelGAN) can reduce such a noise without fine-tuning.
Maybe our paper's results help you. https://espnet.github.io/icassp2022-tts/ https://arxiv.org/abs/2110.07840
When we use pre-trained HiFi vocoder on FastSpeech2, Fastpitch synthesized mel features then vocoded audio contains artifacts (noise). After fine tuning the audio quality significantly get improved. Is this expected behaviour or without fine tuning we should get clean audios without strange artifacts using HiFi or ParallellWaveGAN? ( we experimented on different datasets )