Open trfnhle opened 4 years ago
Sorry for the dense calculation of the MLE loss...
I'll let you know when I clean up the clutter in the code. Temporarily, I'll explain the loss one by one.
The original line I implemented was:
l_mle = 0.5 * math.log(2 * math.pi)
+ (torch.sum(y_logs) + 0.5 * torch.sum(torch.exp(-2 * y_logs) * (z - y_m)**2) - torch.sum(logdet))
/ (torch.sum(y_lengths // hps.model.n_sqz) * hps.model.n_sqz * hps.data.n_mel_channels)
It can be decomposed as
l_mle_normal = torch.sum(y_logs) + 0.5 * torch.sum(torch.exp(-2 * y_logs) * (z - y_m)**2)
l_mle_jacob = -torch.sum(logdet)
l_mle_sum = l_mle_normal + l_mle_jacob
denom = torch.sum(y_lengths // hps.model.n_sqz) * hps.model.n_sqz * hps.data.n_mel_channels
l_mle = 0.5 * math.log(2 * math.pi) + l_mle_sum / denom
1) l_mle_normal
is the negative log likelihood of normal distribution N(z| y_m, y_logs) (except the constant term: 0.5*log(2pi)), where y_m and y_logs are the mean and logarithm of standard deviation of the prior distribution. Please see Equation 2 in the paper.
l_mle_normal = torch.sum(y_logs) + 0.5 * torch.sum(torch.exp(-2 * y_logs) * (z - y_m)**2)
l_mle_jacob
denotes the negative log determinant of jacobian of flows. Please see Equation 1 in the paper.
l_mle_jacob = -torch.sum(logdet)
l_mle_sum
denotes the total negative log likelihood of the model, and denom
is a denominator to average the total negative log likelihood across batch, time steps and mel channels (Our model force mel-spectrogram lengths y_lengths
to be a multiple of n_sqz
.).
l_mle_sum = l_mle_normal + l_mle_jacob
denom = torch.sum(y_lengths // hps.model.n_sqz) * hps.model.n_sqz * hps.data.n_mel_channels
Add the the constant term, 0.5*log(2pi), excluded in step 1.
l_mle = 0.5 * math.log(2 * math.pi) + l_mle_sum / denom
Thanks for your detailed explanation. I think you could ignore the constant term, it does not contribute to backpropagation. Btw, I found another paper that has the same idea of learning implicitly duration of each character but in a different approach AlignTTS.
Yes the constant term is ignored in backpropgation. I just left it for exact calculation of log likelihood. And I saw AlignTTS, which also proposes an alignment search algorithm similar to Glow-TTS. I think it is clever, thanks for the heads up! Btw, I hope you enjoy the interesting characteristics of our model such as manipulating the latent representation of speech :)
Just wanted to say amazing work! Love the controllability of length and expressiveness. I wanted to try a few ideas of my own using your repository as a codebase by I've run into a strange phenomenon. It's related to the loss function so maybe you could help me understand what is the cause. The strange thing is that the value of l_mle
(g0
) loss depends on the value range of Mel spectrograms.
Orange
- LJSpeech wavs transformed into melspecs using default paramters. Melspec values range from 0.5
to -11.5
Pink
- My data transformed the same way as LJSpeech
Blue
- My data transformed to melspecs with different sfft parameters and then scaled to 0.5
to -11.5
range
Gray
- My data transformed to melspecs with different sfft parameters. Value range from 0.
to 0.76
(the same results if multiplied by -1)
From what I was able to check in the case of data in the range of 0 to 0.76 values differ in the following way
l_mle_jacob
- is bigger for Mel spectrograms with smaller absolute values. I think it makes sense because jacobian is calculated based on weights and they have to be bigger to result in the same values.
l_mle_normal
- about the same
denom
- obviously the same
l_mle
- with different proportion of l_mle_sum
and denom
l_mle
no longer normalizes to 1. I think it's a problem because the balance between g0 and g1 is disturbed and alignment gets worse
Also I find it quite strange that grad norm keeps increasing on both Blue
and Gray
curves. The only thing that they have in common is different than default melspec sfft parameters
I am wondering how loss value looks like. Could you give some pictures of the loss during training?