jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.92k stars 506 forks source link

What is the input format? #32

Closed prattcmp closed 3 years ago

prattcmp commented 3 years ago

I've seen a lot of general discussion about inputting generated mels into Hifi-GAN and of course we can see the hparams for mel spectrogram in each config file, but nothing that actually says what the format is for input x to Generator(x). Is it (1, n_mels, frames)? Is normalization expected? Nothing I've tried works.

jik876 commented 3 years ago

It's the input shape in inference. If the shape doesn't match, you can simply add transpose code by referring to #27. No normalization is needed after generating the mel-spectrogram. It would be helpful to find a solution if you post details you have tried.

prattcmp commented 3 years ago

@jik876 I’ve tried transposing. I’ve tried (1, n_mels, frames) and (1, frames, n_mels). I think shape the shape I’m using is correct, but all I get out is static. Is it a preprocessing problem? I use librosa to generate ground truth mel spectrograms for my Tacotron model.

I’ve read through that issue and it did not help.

Miralan commented 3 years ago

@jik876 I’ve tried transposing. I’ve tried (1, n_mels, frames) and (1, frames, n_mels). I think shape the shape I’m using is correct, but all I get out is static. Is it a preprocessing problem? I use librosa to generate ground truth mel spectrograms for my Tacotron model.

I’ve read through that issue and it did not help.

If you use librosa load wav, it is float32 type, so you do not need divide the wav of librosa loading by MAX_SIZE(32768). It wiil make input become almost zero.

jik876 commented 3 years ago

I close this as there are no recent updates. Please reopen if you need additional comments.