Open cantabile-kwok opened 2 years ago
I think this loss is good for:
mu
to be close to y
in the first place.@cantabile-kwok Thank you very much for this question! It is indeed a very subtle moment. Actually, we need the output of the encoder mu to have the following properties:
1) mu should be some reasonable speech representation since we condition score matching network s_theta(x_t, mu, t) on this mu, so we want mu to have some important information about the target speech (e.g. mu should be aligned well with the input text; it corresponds with what you wrote in your previous comment in 1. 2) mu should be close to the target mel-spectrogram y, because the reverse diffusion starts generation from N(mu, I) (it's exactly what you wrote in your previous comment in 2. Note that this second point is not necessary, but it is beneficial from the point of view of the reverse diffusion steps sufficient for a good quality (see Table 1 in our paper).
So, in contrast with Glow-TTS where the analogue of our encoder loss L_enc has clear probabilistic interpretation (it is one of the terms used to calculate log-likelihood optimized during training), in Grad-TTS the encoder should just output mu having the two properties mentioned above. You can consider the encoder output to be a Gaussian distribution (leading to weighted L_2 loss between mu and y), or you can just optimize any other distance between mu and y, and it may also work well. This is one of the differences between Glow-TTS and Grad-TTS: in our model the choice of encoder loss L_enc does not affect the diffusion loss L_diff (they are sort of "independent"), while in Glow-TTS there is a single NLL loss with the analogue of our encoder loss being one of its terms having a clear probabilistic interpretation (i.e. log of the prior).
Great work! I've been studying the paper and the code recently and there's something that confuses me much.
In my understanding, the encoder outputs some Gaussian distributions with different
mu
for each phoneme, and the DPM decoder recovers mel-specy
from these Gaussians. Hencey
is not Gaussian anymore. But I speculate from Eq.(14) and the code that when you are calculating the prior loss, you are actually calculating the log-likelihood ofy
in the Gaussian distribution ofmu
. Also, when applying MAS for duration modeling, you also perform the similar kind of likelihood computation to get the soft alignment (which is denoted aslog_prior
in the code). So I wonder why is it reasonable? I also compared the code of GlowTTS. They usez
to evaluate the Gaussian likelihood with meanmu
, andz
is the transformed latent variable from mel-spec using normalizing flow. That seems more reasonable to me by now, asz
is Gaussian by itself.