pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.54k stars 654 forks source link

Why n_fft=400 by default in Transforms? #384

Closed keunwoochoi closed 4 years ago

keunwoochoi commented 4 years ago

In MelSpectrogram, Spectrogram, GriffinLim, n_fft defaults to 400. Is there a reason for not setting it with a power of 2?

faroit commented 4 years ago

I guess this is coming from old days of (time domain) speech processing where the sample rate is 8kHz and 50ms are considered to be a good window size. Since 512 is also common, changing the default could indeed make sense...

vincentqb commented 4 years ago

The current value has been around for a while, and hasn't changed since this would be BC-breaking. I'd be ok changing it, if there is a strong reason to do so. Thoughts?

keunwoochoi commented 4 years ago

I see. I have no benchmark data, but thought the backed fft operation could be more efficient with `n_fft=2**N. But I don't have a strong opinion now - after googling I realized non-power-of-two fft could be efficient too :)

vincentqb commented 4 years ago

Quick run, and I don't see significant differences :)

In [1]: f = "steam-train-whistle-daniel_simon.wav"                                                                                                                                                          

In [2]: import torchaudio                                                                                                                                                                                   

In [3]: w, s = torchaudio.load(f)                                                                                                                                                                           

In [4]: %timeit torchaudio.transforms.Spectrogram(n_fft=400)(w)                                                                                                                                             
10.5 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit torchaudio.transforms.Spectrogram(n_fft=512)(w)                                                                                                                                             
11.1 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vincentqb commented 4 years ago

I will close this issue for now. Please feel free to re-open if there are more elements you would like to add to the discussion :)

PetrochukM commented 4 years ago

@faroit What would you consider a better window size today? In the recent Tacotron paper, they also used a 50-millisecond frame size; however, the Kaldi spectrogram recommends a 25-millisecond frame size. From my online readings, it sounds like 20 - 30 milliseconds is recommended for a text-to-speech application with a 50% hop length.

faroit commented 4 years ago

@PetrochukM Yes, maybe 32ms (fft = 512) would be a better fit with respect to performance as pointed out by @keunwoochoi