thuhcsi / VAENAR-TTS

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.
MIT License
144 stars 20 forks source link

Can I use a GAN-based network to replace the flow-based prior P(Z|X)? #9

Closed seekerzz closed 3 years ago

seekerzz commented 3 years ago

If I understand this paper and FlowSeq correctly, the normalizing flow is used to model the dependence of text X (from the posterior P(Z|X, Y)). As GAN can also model the distribution, can I use a GAN-based network to replace the flow-based prior P(Z|X)?

light1726 commented 3 years ago

Hi @seekerzz! The issue is that you need the prior to sample z's and inference the probs of z's efficiently. While GAN can do the sampling, it cannot do the inference.

seekerzz commented 3 years ago

Thank you so much for the quick reply!😁 Here is my rough understanding: For P(Z|X, Y), we can use another predicted prob P(Z|X) to get close to it.

  1. If we want to model its distribution explicitly, we need to calculate the prob of P(Z|X,Y) and also the prob of P(Z|X) and use KL to make them close. Thus, P(Z|X, Y) is modelled as Gaussian for a simply prob calculation and P(Z|X) calculated by the reversed normalizing flow.
  2. Maybe I misunderstand your thought, but I suppose inference the probs of z's is used to do the aforementioned idea (to get close to P(Z|X,Y)). What I think is that we can also use a GAN to get close to P(Z|X,Y): Sampling from a certain noise, combined with X, generating G(Z|X) for fooling the discriminator between P(Z|X,Y) and the generated G(Z|X) such that their distributions are close. Will this be OK or I made some mistakes? Thank you!
light1726 commented 3 years ago

Hmm, I was too concerned about the computation of the KL at first glance of your question.

I think your idea is doable. GAN is good at sampling high-quality samples. However, if you sample from GAN and close the distance with P(Z|X, Y), it's would be a reverse KL computation.

But it doesn't mean that it's a bad choice as we care more about the generative quality instead of NLL or ELBO for TTS.

Anyway, I think you can give it a shot. Good luck!

seekerzz commented 3 years ago

Many thanks to you😁😊