lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
767 stars 112 forks source link

Trying to use pretrained model but failed #31

Closed schinavro closed 2 years ago

schinavro commented 2 years ago

Hi, I have trouble using pre-trained model and badly wants your help.

I wanted to check the performance of Diffwave with pretrained prameters.

Since there was no demo for it, I've write my own script that importing pretrained model.

Purpose of the script is to compare the original audio and generated audio from pretrained vocoder.

First, I've generated Mel spectrogram from one of audio samples provieded in https://github.com/lmnt-com/diffwave#audio-samples.


# Audio downloaded from audio samples
# https://github.com/lmnt-com/diffwave#audio-samples
waveform, sample_rate = get_speech_sample()

# define transformation
spectrogram = T.MelSpectrogram(
    sample_rate=22050,
    n_fft = 1024,
    hop_length = 256,
    win_length = hop_length * 4,
    f_min = 20.0,
    f_max=sample_rate/2.,
    n_mels=80
)

# Perform transformation
spec = spectrogram(waveform)
spec = 20 * torch.log10(torch.clamp(spec, min=1e-5)) - 20
spec = torch.clamp((spec + 100) / 100, 0.0, 1.0)

print_stats(spec)
plot_spectrogram(spec[0], title="torchaudio")
plot_waveform(waveform, sample_rate)
Shape: (1, 80, 833)
Dtype: torch.float32
 - Max:      1.000
 - Min:      0.280
 - Mean:     0.698
 - Std Dev:  0.171

tensor([[[0.5525, 0.5410, 0.5013,  ..., 0.4834, 0.5863, 0.6569],
         [0.5485, 0.5346, 0.4632,  ..., 0.4242, 0.6327, 0.6866],
         [0.4129, 0.5611, 0.5924,  ..., 0.4228, 0.6652, 0.7208],
         ...,
         [0.5441, 0.6529, 0.7050,  ..., 0.5078, 0.5972, 0.6283],
         [0.5814, 0.6205, 0.6569,  ..., 0.5178, 0.6150, 0.6492],
         [0.5728, 0.6037, 0.6395,  ..., 0.4996, 0.6498, 0.6952]]])

스크린샷 2022-05-12 오후 9 26 58

Using the created spectrogram, spec, I've generate audio file which should give similar audio from above.

from diffwave.inference import predict as diffwave_predict

# Pretrained parameters. given at the 
# https://github.com/lmnt-com/diffwave#pretrained-models
model_dir = './diffwave/' 
spectrogram = spec # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True, 
                                      device='cpu')
plot_waveform(audio, sample_rate)
play_audio(audio, sample_rate)

스크린샷 2022-05-12 오후 9 30 41

However, the results was far from the original. It was unstable, and doesn't give similar results on the Demo.

Is there any problem on my code? or Is there way of properly using pre-trained parameters?

I would really appreciate if there is any example code that I can use pre-trained model properly.

Thanks.

jfsantos commented 2 years ago

I had the same issue trying to use the pretrained model. In my case it generated silence only.

sharvil commented 2 years ago

Here are the exact steps I use to generate audio from the pretrained model:

git clone https://github.com/lmnt-com/diffwave
cd diffwave
pip install .
wget https://lmnt.com/assets/diffwave/diffwave-ljspeech-22kHz-1000578.pt
wget https://lmnt.com/assets/diffwave/22kHz/ljspeech/reference_0.wav
python -m diffwave.preprocess .
python -m diffwave.inference diffwave-ljspeech-22kHz-1000578.pt -s reference_0.wav.spec.npy -f

The result is placed in output.wav and should sound reasonable. Can you give these steps a shot and see if it works for you?

The most common cause for silence / unstable audio is incorrect input scaling. I've updated the preprocessing script to behave correctly with newer versions of torchaudio; maybe that was the issue you were running into.