About the prior loss and MAS algorithm

cantabile-kwok commented 2 years ago

Great work! I've been studying the paper and the code recently and there's something that confuses me much.

In my understanding, the encoder outputs some Gaussian distributions with different mu for each phoneme, and the DPM decoder recovers mel-spec y from these Gaussians. Hence y is not Gaussian anymore. But I speculate from Eq.(14) and the code that when you are calculating the prior loss, you are actually calculating the log-likelihood of y in the Gaussian distribution of mu. Also, when applying MAS for duration modeling, you also perform the similar kind of likelihood computation to get the soft alignment (which is denoted as log_prior in the code). So I wonder why is it reasonable? I also compared the code of GlowTTS. They use z to evaluate the Gaussian likelihood with mean mu, and z is the transformed latent variable from mel-spec using normalizing flow. That seems more reasonable to me by now, as z is Gaussian by itself.

cantabile-kwok commented 2 years ago

I think this loss is good for:

It is necessary for MAS as GradTTS is considering the same likelihood to be the soft alignment matrix for MAS algorithm. My experiments show that if using ground truth durations, then canceling prior loss does not decrease quality.
It helps converging, as it pushes mu to be close to y in the first place.

li1jkdaw commented 1 year ago

@cantabile-kwok Thank you very much for this question! It is indeed a very subtle moment. Actually, we need the output of the encoder mu to have the following properties:

1) mu should be some reasonable speech representation since we condition score matching network s_theta(x_t, mu, t) on this mu, so we want mu to have some important information about the target speech (e.g. mu should be aligned well with the input text; it corresponds with what you wrote in your previous comment in 1. 2) mu should be close to the target mel-spectrogram y, because the reverse diffusion starts generation from N(mu, I) (it's exactly what you wrote in your previous comment in 2. Note that this second point is not necessary, but it is beneficial from the point of view of the reverse diffusion steps sufficient for a good quality (see Table 1 in our paper).

So, in contrast with Glow-TTS where the analogue of our encoder loss L_enc has clear probabilistic interpretation (it is one of the terms used to calculate log-likelihood optimized during training), in Grad-TTS the encoder should just output mu having the two properties mentioned above. You can consider the encoder output to be a Gaussian distribution (leading to weighted L_2 loss between mu and y), or you can just optimize any other distance between mu and y, and it may also work well. This is one of the differences between Glow-TTS and Grad-TTS: in our model the choice of encoder loss L_enc does not affect the diffusion loss L_diff (they are sort of "independent"), while in Glow-TTS there is a single NLL loss with the analogue of our encoder loss being one of its terms having a clear probabilistic interpretation (i.e. log of the prior).

huawei-noah / Speech-Backbones

About the prior loss and MAS algorithm #18