r9y9 / wavenet_vocoder

WaveNet vocoder
https://r9y9.github.io/wavenet_vocoder/
Other
2.3k stars 500 forks source link

Providing new spectrogram to synthesis.wavegen method #184

Closed m-k-S closed 4 years ago

m-k-S commented 4 years ago

Hi, I have my own set of spectrograms that I am trying to generate .wav files for using your WaveNet implementation and the pretrained LJSpeech weights. The spectrograms were generated using Librosa; I made sure to match the sampling rate, number of Mels, length of FFT window, and hop length to those specified in the pretrained LJSpeech hyperparameter file (22050, 80, 1025, 256 respectively as far as I can tell).

The code used to generate those spectrograms is this: S = librosa.feature.melspectrogram(S=audio, sr=22050, n_mels=80, n_fft=1024, hop_length=256)

I then perform:

# WaveNet
wn_preset = "./wavenet_vocoder/pretrained/20180510_mixture_lj_checkpoint_step000320000_ema.json"
wn_checkpoint_path = "./wavenet_vocoder/pretrained/20180510_mixture_lj_checkpoint_step000320000_ema.pth"

# Setup WaveNet vocoder hparams
from wavenet_vocoder.hparams import hparams
with open(wn_preset) as f:
    hparams.parse_json(f.read())

# Setup WaveNet vocoder
from wavenet_vocoder.train import build_model
from wavenet_vocoder.synthesis import wavegen
import torch

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

model = build_model().to(device)

print("Load checkpoint from {}".format(wn_checkpoint_path))
checkpoint = torch.load(wn_checkpoint_path, map_location=torch.device('cpu'))
model.load_state_dict(checkpoint["state_dict"])

c = np.swapaxes(S, 0, 1)
c = np.interp(c, (0, 4), (0, 1))
waveform = wavegen(model, c=c, fast=True, tqdm=tqdm)

However, the audio quality from this is quite poor. Note: I have tried this both with and without the np.interp call in the second to last line, and the audio quality is the same in either case.

In particular, I used the Tacotron+WaveNet Colab Notebook provided in the README to generate an audio file (which is of high quality), then used the Librosa method to convert it into a spectrogram as described above, and then pass it back through the wavegen process, and the output is unintelligible.

Is there some documentation of what specific format the wavegen method wants the spectrogram matrix in? Is there additional preprocessing I should be doing?

r9y9 commented 4 years ago

Hi, sorry for the inconvenience. At the time I trained the model for LJSpeech, I extracted mel-spectrogram by https://github.com/r9y9/wavenet_vocoder/blob/6be7c72298fccc7f2331ac54af8b6191958e3013/audio.py#L67-L72

So, can you try using lws for STFT instead?

Well, the issue was closed while I was writing, so maybe you found the solution already.

m-k-S commented 4 years ago

Yes, I was able to follow those methods and reconstruct a spectrogram that seems to work. Sorry to have opened an issue.