Closed keunwoochoi closed 4 years ago
I guess this is coming from old days of (time domain) speech processing where the sample rate is 8kHz and 50ms are considered to be a good window size. Since 512 is also common, changing the default could indeed make sense...
The current value has been around for a while, and hasn't changed since this would be BC-breaking. I'd be ok changing it, if there is a strong reason to do so. Thoughts?
I see. I have no benchmark data, but thought the backed fft operation could be more efficient with `n_fft=2**N. But I don't have a strong opinion now - after googling I realized non-power-of-two fft could be efficient too :)
Quick run, and I don't see significant differences :)
In [1]: f = "steam-train-whistle-daniel_simon.wav"
In [2]: import torchaudio
In [3]: w, s = torchaudio.load(f)
In [4]: %timeit torchaudio.transforms.Spectrogram(n_fft=400)(w)
10.5 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit torchaudio.transforms.Spectrogram(n_fft=512)(w)
11.1 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I will close this issue for now. Please feel free to re-open if there are more elements you would like to add to the discussion :)
@faroit What would you consider a better window size today? In the recent Tacotron paper, they also used a 50-millisecond frame size; however, the Kaldi spectrogram recommends a 25-millisecond frame size. From my online readings, it sounds like 20 - 30 milliseconds is recommended for a text-to-speech application with a 50% hop length.
@PetrochukM Yes, maybe 32ms (fft = 512) would be a better fit with respect to performance as pointed out by @keunwoochoi
In
MelSpectrogram
,Spectrogram
,GriffinLim
,n_fft
defaults to 400. Is there a reason for not setting it with a power of 2?