Open Alexey322 opened 3 years ago
Hi @Alexey322
I think the author used padding for doing stft
(aka fast fourier transform) on all frames of the input audio segment.
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
center=center, pad_mode='reflect', normalized=False, onesided=True)
You can check the torch.stft function from the API doc for more details.
Hi @leminhnguyen.
Thanks for your reply. Why can't we just align the fragment size with convolutions? With v1 configuration 29 mels correspond to 8192 samples, what's the point of adding redundant data?
Why do we need pad audio fragment while receiving its mel spec?
y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')