kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.57k stars 343 forks source link

Loudness norm #291

Closed sciai-ai closed 3 years ago

sciai-ai commented 3 years ago

Hi, I have noticed that the loudness of the synthesized wavefrom varies for PWG. Is it possible tom make sure that the synthesised waveform has the same loudness?

kan-bayashi commented 3 years ago

Maybe this is because Random Gaussian noise is used as the input for PWG. If you want to fix the results, please set random seed.

sciai-ai commented 3 years ago

I saw this line in synthesis, does it have any effect?

vocoder.remove_weight_norm()

kan-bayashi commented 3 years ago

I think it is not related.

sciai-ai commented 3 years ago

Maybe this is because Random Gaussian noise is used as the input for PWG. If you want to fix the results, please set random seed.

Thanks for your quick reply, will the random seed also fix the droput in tacotron 2 model. Is it possible to fix state of PWG and not Taco2

kan-bayashi commented 3 years ago

Set random number as seed -> Taco2 -> set fixed seed -> PWG?

sciai-ai commented 3 years ago
import time
import torch
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.tts_inference import Text2Speech
from parallel_wavegan.utils import download_pretrained_model
from parallel_wavegan.utils import load_model
d = ModelDownloader()
text2speech = Text2Speech(
    **d.download_and_unpack(tag),
    device="cuda",
    # Only for Tacotron 2
    threshold=0.5,
    minlenratio=0.0,
    maxlenratio=10.0,
    use_att_constraint=False,
    backward_window=1,
    forward_window=3,
    # Only for FastSpeech & FastSpeech2
    speed_control_alpha=1.0,
)
text2speech.spc2wav = None  # Disable griffin-lim

vocoder = load_model(download_pretrained_model(vocoder_tag)).to("cuda").eval()
vocoder.remove_weight_norm()

with torch.no_grad():
    start = time.time()
    wav, c, *_ = text2speech(x)
    wav = vocoder.inference(c)
rtf = (time.time() - start) / (len(wav) / fs)
print(f"RTF = {rtf:5f}")

Looking at this code I am not sure which two places I need to add the random and manual seeds?

kan-bayashi commented 3 years ago
with torch.no_grad():
    start = time.time()
    # here for taco2
    wav, c, *_ = text2speech(x)
    # here for pwg
    wav = vocoder.inference(c)
rtf = (time.time() - start) / (len(wav) / fs)
print(f"RTF = {rtf:5f}")
sciai-ai commented 3 years ago

Thank you @kan-bayashi. I am waiting to try some new vocoders you introduced as well :) Great job!