Open seekerzz opened 3 years ago
Hi @seekerzz , t is always 1 in our setting.
Thanks for your reply!
Have you tried the multi-speaker situation? I used the code for LibriTTS training. However, the performance is bad and KL is high (at the 10^3 level). I also added the initial process of mu
and logvar
from the flowseq repo (to make them output at around 0), but this won't help.
I tried to first train the posterior (only use the mel loss )and then the prior (only use KL), but this still won't converge. I also checked whether the posterior P(Z|X,Y) and the decoder P(Y|Z,X) just discards the information of X (like an encoder-decoder of Y), but the decoder alignment shows that the information of X is used.
Thus, this makes me wondering, why the prior fails to learn from the posterior:
By the way, this is my training curve I did not train the length predictor (just using the ground truth length).
Can you share the synthesized samples? And where did you apply the speaker information, e.g., speaker embedding?
Thanks for the quick reply!😁 I add the speaker embedding into the text embedding (as I think Z can be viewed a style mapping from text X to mel Y, adding speaker information to X is more intuitive) . However, the synthesized samples are still very bad after about 40 epochs on LibriTTS. For example, the predicted and the groundtruth. However, if only train the posterior, the predicted mel is quite OK.
I read another flow-based TTS: Glow-TTS, and find that they conditioned the speaker information on Z. Maybe I should try their merging method.🤔
Thanks for the sharing. So if I understood correctly, you add the speaker embedding to the text embedding right after the text encoder so that both posterior and prior encoder can take the speaker-dependent hidden representations X, am I right? If so, is it different from the Glow-TTS' conditioning method as they explained?
To train muli-speaker Glow-TTS, we add the speaker embedding and increase the hidden dimension. The speaker embedding is applied in all affine coupling layers of the decoder as a global conditioning
I quoted it from section 4 of the Glow-TTS paper.
Yes! I am going to try their conditioning method. If it succeed I will share the result.😊
Ah, I see. I think It should work if you adopt the same way. Looking forward to seeing it!
@seekerzz hey, have you made any progress?
Hello! I find there might be a mistake in the code (just now)!
In VAENAR.py
But in posterior.py
I'm trying to train the multi-speaker version again to see the results.😁
(Curious about why LJSpeech still works, haha)
Great! Hope to get the clear sample soon. That's intended since we are not interested in the alignment from the posterior, so you should get no error from it when you use the same code for the multi-speaker setting.
Hello, I mean the position of Mu and Logvar are misplaced.
Ah, sorry for the misunderstanding. Yes, you're right. It should be switched. But the reason why it's still working is that they are the same but wrongly named (reversed). In other words, mu_projection
in the current implementation predicts logvar, and logvar_projection
predicts mu. I will retrain the model with this fixation when I have room for that. Thanks for the report!
Thanks for your reply! I have understood that they can replace each other's variable name!
My main problem for the multi-speaker training is that the prior cannot converge. The posterior and decoder can be trained easily within about 20 epochs, Although the decoder attention seems a little noisy, it is correct,
So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level).
The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)
) nearly equal to mu
. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech).
I don't know whether this can be a problem for the flow-based model.🤔
@seekerzz Could you share any synthesized samples?
hi,I have met the same problem when I joined a vq encoder after posterior and prior encoder.The kl was 1e+4 and won't converged.Did you finish the job?
Thanks for your reply! I have understood that they can replace each other's variable name!
My main problem for the multi-speaker training is that the prior cannot converge. The posterior and decoder can be trained easily within about 20 epochs, Although the decoder attention seems a little noisy, it is correct,
So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level). The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean
samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)
) nearly equal tomu
. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech). I don't know whether this can be a problem for the flow-based model.🤔
Hello, thanks for sharing the pytorch-based code! However, I have some question about the
_initial_sample
func inmodel/prior.py
.epsilon
is sampled from N(0, t) (t is the temperature), how its logprob is calculated? For norm distribution, After log (the mean is 0) . Can you explain why use\sigma
as 1 instead oft
here?