keonlee9420 / VAENAR-TTS

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.
MIT License
72 stars 14 forks source link

For model/prior.py _initial_sample, why the prob is calculated as from N(0,1)? #2

Open seekerzz opened 3 years ago

seekerzz commented 3 years ago

Hello, thanks for sharing the pytorch-based code! However, I have some question about the _initial_sample func in model/prior.py. epsilon is sampled from N(0, t) (t is the temperature), how its logprob is calculated? For norm distribution, image After log (the mean is 0) image. Can you explain why use \sigma as 1 instead of t here?

keonlee9420 commented 3 years ago

Hi @seekerzz , t is always 1 in our setting.

seekerzz commented 3 years ago

Thanks for your reply! Have you tried the multi-speaker situation? I used the code for LibriTTS training. However, the performance is bad and KL is high (at the 10^3 level). I also added the initial process of mu and logvar from the flowseq repo (to make them output at around 0), but this won't help. I tried to first train the posterior (only use the mel loss )and then the prior (only use KL), but this still won't converge. I also checked whether the posterior P(Z|X,Y) and the decoder P(Y|Z,X) just discards the information of X (like an encoder-decoder of Y), but the decoder alignment shows that the information of X is used. Thus, this makes me wondering, why the prior fails to learn from the posterior:

seekerzz commented 3 years ago

By the way, this is my training curve image I did not train the length predictor (just using the ground truth length).

keonlee9420 commented 3 years ago

Can you share the synthesized samples? And where did you apply the speaker information, e.g., speaker embedding?

seekerzz commented 3 years ago

Thanks for the quick reply!😁 I add the speaker embedding into the text embedding (as I think Z can be viewed a style mapping from text X to mel Y, adding speaker information to X is more intuitive) . However, the synthesized samples are still very bad after about 40 epochs on LibriTTS. For example, the predicted and the groundtruth. image image However, if only train the posterior, the predicted mel is quite OK. image

I read another flow-based TTS: Glow-TTS, and find that they conditioned the speaker information on Z. Maybe I should try their merging method.🤔

keonlee9420 commented 3 years ago

Thanks for the sharing. So if I understood correctly, you add the speaker embedding to the text embedding right after the text encoder so that both posterior and prior encoder can take the speaker-dependent hidden representations X, am I right? If so, is it different from the Glow-TTS' conditioning method as they explained?

To train muli-speaker Glow-TTS, we add the speaker embedding and increase the hidden dimension. The speaker embedding is applied in all affine coupling layers of the decoder as a global conditioning

I quoted it from section 4 of the Glow-TTS paper.

seekerzz commented 3 years ago

Yes! I am going to try their conditioning method. If it succeed I will share the result.😊

keonlee9420 commented 3 years ago

Ah, I see. I think It should work if you adopt the same way. Looking forward to seeing it!

keonlee9420 commented 3 years ago

@seekerzz hey, have you made any progress?

seekerzz commented 3 years ago

Hello! I find there might be a mistake in the code (just now)! In VAENAR.py image But in posterior.py image I'm trying to train the multi-speaker version again to see the results.😁 (Curious about why LJSpeech still works, haha)

keonlee9420 commented 3 years ago

Great! Hope to get the clear sample soon. That's intended since we are not interested in the alignment from the posterior, so you should get no error from it when you use the same code for the multi-speaker setting.

seekerzz commented 3 years ago

Hello, I mean the position of Mu and Logvar are misplaced.

keonlee9420 commented 3 years ago

Ah, sorry for the misunderstanding. Yes, you're right. It should be switched. But the reason why it's still working is that they are the same but wrongly named (reversed). In other words, mu_projection in the current implementation predicts logvar, and logvar_projection predicts mu. I will retrain the model with this fixation when I have room for that. Thanks for the report!

seekerzz commented 3 years ago

Thanks for your reply! I have understood that they can replace each other's variable name!

My main problem for the multi-speaker training is that the prior cannot converge. The posterior and decoder can be trained easily within about 20 epochs, image Although the decoder attention seems a little noisy, it is correct, image

So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level). The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)) nearly equal to mu. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech). image I don't know whether this can be a problem for the flow-based model.🤔

wizardk commented 2 years ago

@seekerzz Could you share any synthesized samples?

whh07141 commented 2 years ago

hi,I have met the same problem when I joined a vq encoder after posterior and prior encoder.The kl was 1e+4 and won't converged.Did you finish the job?

Thanks for your reply! I have understood that they can replace each other's variable name!

My main problem for the multi-speaker training is that the prior cannot converge. The posterior and decoder can be trained easily within about 20 epochs, image Although the decoder attention seems a little noisy, it is correct, image

So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level). The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)) nearly equal to mu. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech). image I don't know whether this can be a problem for the flow-based model.🤔