MPD Vs Filter-bank discriminator

jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

MIT License

1.92k stars 506 forks source link

Hi, Thanks for sharing this great work. I have one theoretical questions regarding the usage of the multi-period discriminator (MPD):

As I understand the MPD, the input waveform is reshaped according to the target period p in order to obtain a 2d map that models different periods of the signal. Actually, when applying this simple reshaping, the subbands of the signal overlap with each other when plotting their spectrograms. If that's correct, what do you think of using a simple filter-bank to decompose the speech waveform into subbands without such overlapping issue?

I would really appreciate if you already have some experiments on that or at least if you could explain what is the different between having multiple periods of the signal versus multiple subbands.

Thanks for your interest. Since we haven't experimented with the settings you mentioned, it's difficult to comment on what the results will be. What we intend is for neural networks to learn periodic patterns of raw waveform as it is. Various frequency components will be included in the data for a specific period, and their harmonics will also be included. The method we have proposed is for neural networks to train exactly as they are, not for training specific frequencies. If a model is trained by decomposing frequencies, I think it will be trained by looking at patterns different from that of the audio that needs to be generated. I think this is the key difference between training periodic patterns of raw waveform as it is and training by decomposing frequencies. There are several work that have achieved good results using subband, and I think it is a very good approach to improving efficiency.

jik876 / hifi-gan

MPD Vs Filter-bank discriminator #35