Closed m-k-S closed 4 years ago
Hi, sorry for the inconvenience. At the time I trained the model for LJSpeech, I extracted mel-spectrogram by https://github.com/r9y9/wavenet_vocoder/blob/6be7c72298fccc7f2331ac54af8b6191958e3013/audio.py#L67-L72
So, can you try using lws for STFT instead?
Well, the issue was closed while I was writing, so maybe you found the solution already.
Yes, I was able to follow those methods and reconstruct a spectrogram that seems to work. Sorry to have opened an issue.
Hi, I have my own set of spectrograms that I am trying to generate .wav files for using your WaveNet implementation and the pretrained LJSpeech weights. The spectrograms were generated using Librosa; I made sure to match the sampling rate, number of Mels, length of FFT window, and hop length to those specified in the pretrained LJSpeech hyperparameter file (22050, 80, 1025, 256 respectively as far as I can tell).
The code used to generate those spectrograms is this:
S = librosa.feature.melspectrogram(S=audio, sr=22050, n_mels=80, n_fft=1024, hop_length=256)
I then perform:
However, the audio quality from this is quite poor. Note: I have tried this both with and without the np.interp call in the second to last line, and the audio quality is the same in either case.
In particular, I used the Tacotron+WaveNet Colab Notebook provided in the README to generate an audio file (which is of high quality), then used the Librosa method to convert it into a spectrogram as described above, and then pass it back through the
wavegen
process, and the output is unintelligible.Is there some documentation of what specific format the
wavegen
method wants the spectrogram matrix in? Is there additional preprocessing I should be doing?