lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
754 stars 111 forks source link

How is the value of crop_mel_frames chosen? #22

Closed jthickstun closed 2 years ago

jthickstun commented 2 years ago

In params.py we set crop_mel_frames=62, with the comment "Probably an error in the paper." I wasn't able to find any discussion of this parameter in the DiffWave paper (is this the paper that the comment refers to?) and I'm curious where it comes from. Could someone clarify where this crop length comes from? Apologies if I have overlooked something obvious.

sharvil commented 2 years ago

That's a good question. The crop length is a computed parameter so there's no explicit mention of it in the paper. In Section 5.1, Conditioner, the authors state:

We set FFT size to 1024, hop size to 256, and window size to 1024.

Later, in the same section under Training, they also say:

We train DiffWave on 8 Nvidia 2080Ti GPUs using random short audio clips of 16,000 samples from each utterance.

If you take a random crop of 16000 samples and then compute its STFT with a hop size of 256, you get 16000/256 = 62.5 frames. So you either have to zero-pad the input to 16128 samples to get 63 frames or crop to 15872 samples to get 62 frames. In either case, you're operating on an audio clip that's not actually 16000 samples.

jthickstun commented 2 years ago

That makes sense. Thank you!