jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.95k stars 505 forks source link

Pad audio fragment #90

Open Alexey322 opened 3 years ago

Alexey322 commented 3 years ago

Why do we need pad audio fragment while receiving its mel spec?

y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')

leminhnguyen commented 3 years ago

Hi @Alexey322

I think the author used padding for doing stft (aka fast fourier transform) on all frames of the input audio segment.

spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
                      center=center, pad_mode='reflect', normalized=False, onesided=True)

You can check the torch.stft function from the API doc for more details.

Alexey322 commented 3 years ago

Hi @leminhnguyen.

Thanks for your reply. Why can't we just align the fragment size with convolutions? With v1 configuration 29 mels correspond to 8192 samples, what's the point of adding redundant data?