lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
767 stars 112 forks source link

High pitched voices when scaling fft size up to 4096 #11

Closed egaebel closed 3 years ago

egaebel commented 3 years ago

Let me start by saying that this repo is fantastic. I've successfully synthesized voices and would like to experiment with scaling up fft size and other audio parameters.

I'm running with the following:

n_fft: 4096 hop_samples: 256 sample_rate: 32000

I'm able to train and the loss goes down quite a lot, but when I listen to the sample voices they are very high pitched compared to when training with n_fft = 1024. I think somewhere during training the voices are being squeezed together and messing with the pitch.

Are there any modifications that need to be made to make this work? For reference I'm training on the ljspeech dataset.

Thank you!

sharvil commented 3 years ago

There's probably a mismatch between the sampling rate of the .wav files and what's specified in params.py. My best guess is that you forgot to run preprocess.py again after changing the parameters. Can you resample the .wav files, re-run the preprocessing script, and try again?

egaebel commented 3 years ago

Excellent catch, I had realized shortly after posting that this was the issue. I changed some of the code to use librosa to load the wav and resample it (code below) as it appears to be faster than torchaudio's resampler and now it's chugging along. However, after doing this I'm only getting silence in the audio samples during training. I think maybe crop_mel_frames is too small now that each frame is over less time? I tried cranking it up to 256 (using almost all my GPU memory in the process), but it's still silent.

if self.use_torchaudio:
    signal, signal_sample_rate = torchaudio.load_wav(audio_filename)
else:
    signal, signal_sample_rate = librosa.core.load(
        audio_filename, sr=self.sample_rate
    )
    signal = torch.unsqueeze(torch.tensor(signal), 0)
sharvil commented 3 years ago

I'm guessing that's because in your code, signal is a float sequence in the range [-1.0, 1.0] but dataset.py expects the signal to be an int16 sequence in the range [-32768, 32767]. IIRC, librosa rescales the input file whereas torchaudio doesn't.

If that's the issue, try removing the division by 32767.5 in dataset.py.

egaebel commented 3 years ago

That would make a ton of sense. I'm running something else on my GPUs at the moment, but I'll give that a try early next week and report back. Thank you for your help!

egaebel commented 3 years ago

That did it. Training is proceeding really well and the samples are very clear. Thank you!