jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.92k stars 506 forks source link

Batch synthesis noise at the end #50

Closed ctlaltdefeat closed 3 years ago

ctlaltdefeat commented 3 years ago

When doing batch synthesis (inference), I zero-pad the mel inputs so that they are the same length, which causes a harsh, buzzing sound to be generated by HiFi-GAN.

Assuming that batching is required for my application's performance purposes, what is the advised approach to dealing with this issue? I don't see support for passing in any sort of mask argument. Should I just try to heuristically cut the resulting wav audio so as to eliminate the noise at the end?

Miralan commented 3 years ago

Why not try to mask the rest of wav of zeros The output wav length should be hop_size * mel_length.

CookiePPP commented 3 years ago

You can also pad the spectrogram with -11.52 which should make the padded area equivalent to silence.

ctlaltdefeat commented 3 years ago

Why not try to mask the rest of wav of zeros The output wav length should be hop_size * mel_length.

Thanks, you're right about that and it's the obvious solution, and I do know the hop_size so that's easy to implement.

You can also pad the spectrogram with -11.52 which should make the padded area equivalent to silence.

That would be better than zero-padding but I haven't tested whether a trained hifigan model would accurately convert those frames to total silence.