Closed jthickstun closed 2 years ago
That's a good question. The crop length is a computed parameter so there's no explicit mention of it in the paper. In Section 5.1, Conditioner, the authors state:
We set FFT size to 1024, hop size to 256, and window size to 1024.
Later, in the same section under Training, they also say:
We train DiffWave on 8 Nvidia 2080Ti GPUs using random short audio clips of 16,000 samples from each utterance.
If you take a random crop of 16000 samples and then compute its STFT with a hop size of 256, you get 16000/256 = 62.5
frames. So you either have to zero-pad the input to 16128 samples to get 63 frames or crop to 15872 samples to get 62 frames. In either case, you're operating on an audio clip that's not actually 16000 samples.
That makes sense. Thank you!
In
params.py
we setcrop_mel_frames=62
, with the comment "Probably an error in the paper." I wasn't able to find any discussion of this parameter in the DiffWave paper (is this the paper that the comment refers to?) and I'm curious where it comes from. Could someone clarify where this crop length comes from? Apologies if I have overlooked something obvious.