jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.86k stars 1.26k forks source link

Questions about Loss_dur #13

Open Liujingxiu23 opened 3 years ago

Liujingxiu23 commented 3 years ago

As in paper Loss_dur is the negative variational lower bound of equation 7

equation-7

Loss_dur = -log(p/q)=logq - logp

Is the logq(x) computed as follows:

equation-8

In the code:

equation-9

Why in the computation of logq, " - logdet_tot_q" not " + logdet_tot_q" ?

nartes commented 3 years ago

@Liujingxiu23 did you manage to train the model from scratch? what is the value for dur loss in your logs? i get it as a negative value pretty soon. synthesis does have duration prediction issues indeed.

Liujingxiu23 commented 3 years ago

@nartes Yes, I can train from scratch successfully, and the synthesized wavs are good. Though I do not understand the loss computation, the synthesized wavs shows that the duration model is right.

You can take the duration loss picture in https://github.com/jaywalnut310/vits/issues/14 as reference. When converging, the duration loss is about 1.1

nartes commented 3 years ago

@Liujingxiu23 did you train ljs or vctk setup? i'm trying vctk one at the moment. I've turned fp16_run off (did you use it in your case), what batch size did you use?

nartes commented 3 years ago

image Yeah, something like from 4k steps it gets negative.

Liujingxiu23 commented 3 years ago

@nartes I did not train LJ or VCTK, I trained on Chinese dataset. I only revised code related to language, and used the default model config. "fp16_run": true," as default.

nartes commented 3 years ago

Okay, segment_size and batch_size? What GPU VRAM size you use? I have 8GiB and by default this model is too big. Also, get some run time errors from pytorch with fp16, that's why turned it off.

nartes commented 3 years ago

Tried another config, 4k steps, dur is not negative. will train for 50K to see further. image This is a spectrogram of both models says "good evening", scratch.ogg from current model sounds a littlbe bit like that. Will see the result in 8 hours from now.

nartes commented 3 years ago

image It's a current tensorboard.

Liujingxiu23 commented 3 years ago

I changed only the sampling rate to 16k, and left all other parameters as default. "sampling_rate": 16000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "n_mel_channels": 80,

The training of vits really consumes resources and a little time comsuming. I use 2 GPU of V100(32G). And the batch size of 64.

The another mentioned in other issues that the synthesized wavs can achieve good performance at 180k steps (G_180000.pth). My testes on Chinese matches this statement。 But at the very beginning of the training ,for example, G_10000.pth, you can synthesized the test wavs to see whether the model is training normally. At G_10000.pth, though the synthesized wavs are not very good, they are totally understandable, and the duration is right.

Liujingxiu23 commented 3 years ago

@nartes I guess you may take model config and training config as default and check if anything wrong at data preprocess . My duration loss:

duration
nartes commented 3 years ago

@Liujingxiu23 well, just checked a current result. image

still dur loss gets negative at some points. tried to listen to the audio. complete gibberish. image

I don't have 2x32GiB GPU, only 8GiB one. So will try to enable fp16, maybe it is important. Also, think about switching to 8KHz sampling rate, at the stage, should allow bigger batch size. By default, model runs only with batch_size==1 for me. Which makes no sense for SGD, cause lots of variance.

nartes commented 3 years ago

Started 8KHz, ljs_nosdp, batch_size==1, fp16 == false image

image Will check the result in 8hours from now.

nartes commented 3 years ago

image

image

image

Looks a bit better, yet generated sound is non sense. Maybe need to fine tune fft parameters as well, cause it can be too rough for 8192 sample rate.

Will stop for now on this model. Looks like a waste of time.

Liujingxiu23 commented 3 years ago

I guess batch_size=1 may raises some problem. In consideration of the limit of the GPU resources, maybe you can change the dataloader code. The current dataloader load the whole length of the wav, but in training only "segment" is used. You may change to sample segment in dataloader and only push "segment". I tried this way in other E2E TTS training. In one GPU(16G), batch_size=64 can be used.

Why set sampel-rate=8192, 8192 is your target sampel-rate ?

What is your “ "upsample_initial_channel"?Change it to 256 / 128 may get the model smaller?

p0p4k commented 1 year ago

As in paper Loss_dur is the negative variational lower bound of equation 7 equation-7

Loss_dur = -log(p/q)=logq - logp

Is the logq(x) computed as follows: equation-8

In the code: equation-9

Why in the computation of logq, " - logdet_tot_q" not " + logdet_tot_q" ?

hi @Liujingxiu23 I understand why there is not "+" there. Do you want me to explain it?

Subuday commented 10 months ago

Hi @p0p4k , yeah it would be nice if you explain it. Honestly, I don't clearly understand in general the formula under log and nll.

Liujingxiu23 commented 6 months ago

@p0p4k though long time passed, would you please explain it? Thanks a lot

BorisovMaksim commented 3 months ago

@Liujingxiu23 @Subuday I think this minus is due to change of variables in normalizing flow

General case

Let $y = f{\theta}(z)$, where $f{\theta}$ is bijective and differentiable.

We assume that $p{\theta}(y| c) \sim N (y; \mu{\theta}(c), \sigma_{\theta}(c))$

Then, by rule of change-of-variables: $$p{\theta}(z|c) = p{\theta}(y| c) \left| det \frac{\partial y}{\partial z} \right| = N(y| \mu{\theta}(c), \sigma{\theta}(c) ) \left| det \frac{\partial y}{\partial z} \right|$$

Typically, $f{\theta}(z) = f{k} \circ f{k-1} \circ ... \circ f{1}$ where $f_{i}$ are bijective and differentiable. In this model we have $k=4$ flows

$$log\left(p{\theta}(z|c)\right) = log\left(p{\theta}(y| c)\right) + \sum{i=1}^{k} log\left| det \frac{\partial y{i}}{\partial z_{i}} \right|$$

Current case

Since our flow transforms noise $e_q$ to [u, v], we have

$$log\left(q_{\theta}(eq | d, c)\right) = log\left(q{\theta}(u, v| d, c)\right) + \sum{i=1}^{k} log\left| det \frac{\partial uv{i}}{\partial {e{q}}{i}} \right|$$

$$log\left(q{\theta}(u, v| d, c)\right) = log\left(q{\theta}(eq | d, c)\right) - \sum{i=1}^{k} log\left| det \frac{\partial uv{i}}{\partial {e{q}}_{i}} \right| $$