Open Liujingxiu23 opened 3 years ago
@Liujingxiu23 did you manage to train the model from scratch? what is the value for dur loss in your logs? i get it as a negative value pretty soon. synthesis does have duration prediction issues indeed.
@nartes Yes, I can train from scratch successfully, and the synthesized wavs are good. Though I do not understand the loss computation, the synthesized wavs shows that the duration model is right.
You can take the duration loss picture in https://github.com/jaywalnut310/vits/issues/14 as reference. When converging, the duration loss is about 1.1
@Liujingxiu23 did you train ljs or vctk setup? i'm trying vctk one at the moment. I've turned fp16_run off (did you use it in your case), what batch size did you use?
Yeah, something like from 4k steps it gets negative.
@nartes I did not train LJ or VCTK, I trained on Chinese dataset. I only revised code related to language, and used the default model config. "fp16_run": true," as default.
Okay, segment_size and batch_size? What GPU VRAM size you use? I have 8GiB and by default this model is too big. Also, get some run time errors from pytorch with fp16, that's why turned it off.
Tried another config, 4k steps, dur is not negative. will train for 50K to see further. This is a spectrogram of both models says "good evening", scratch.ogg from current model sounds a littlbe bit like that. Will see the result in 8 hours from now.
It's a current tensorboard.
I changed only the sampling rate to 16k, and left all other parameters as default. "sampling_rate": 16000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "n_mel_channels": 80,
The training of vits really consumes resources and a little time comsuming. I use 2 GPU of V100(32G). And the batch size of 64.
The another mentioned in other issues that the synthesized wavs can achieve good performance at 180k steps (G_180000.pth). My testes on Chinese matches this statement。 But at the very beginning of the training ,for example, G_10000.pth, you can synthesized the test wavs to see whether the model is training normally. At G_10000.pth, though the synthesized wavs are not very good, they are totally understandable, and the duration is right.
@nartes I guess you may take model config and training config as default and check if anything wrong at data preprocess . My duration loss:
@Liujingxiu23 well, just checked a current result.
still dur loss gets negative at some points. tried to listen to the audio. complete gibberish.
I don't have 2x32GiB GPU, only 8GiB one. So will try to enable fp16, maybe it is important. Also, think about switching to 8KHz sampling rate, at the stage, should allow bigger batch size. By default, model runs only with batch_size==1 for me. Which makes no sense for SGD, cause lots of variance.
Started 8KHz, ljs_nosdp, batch_size==1, fp16 == false
Will check the result in 8hours from now.
Looks a bit better, yet generated sound is non sense. Maybe need to fine tune fft parameters as well, cause it can be too rough for 8192 sample rate.
Will stop for now on this model. Looks like a waste of time.
I guess batch_size=1 may raises some problem. In consideration of the limit of the GPU resources, maybe you can change the dataloader code. The current dataloader load the whole length of the wav, but in training only "segment" is used. You may change to sample segment in dataloader and only push "segment". I tried this way in other E2E TTS training. In one GPU(16G), batch_size=64 can be used.
Why set sampel-rate=8192, 8192 is your target sampel-rate ?
What is your “ "upsample_initial_channel"?Change it to 256 / 128 may get the model smaller?
As in paper Loss_dur is the negative variational lower bound of equation 7
Loss_dur = -log(p/q)=logq - logp
Is the logq(x) computed as follows:
In the code:
Why in the computation of logq, " - logdet_tot_q" not " + logdet_tot_q" ?
hi @Liujingxiu23 I understand why there is not "+" there. Do you want me to explain it?
Hi @p0p4k , yeah it would be nice if you explain it. Honestly, I don't clearly understand in general the formula under log and nll.
@p0p4k though long time passed, would you please explain it? Thanks a lot
@Liujingxiu23 @Subuday I think this minus is due to change of variables in normalizing flow
Let $y = f{\theta}(z)$, where $f{\theta}$ is bijective and differentiable.
We assume that $p{\theta}(y| c) \sim N (y; \mu{\theta}(c), \sigma_{\theta}(c))$
Then, by rule of change-of-variables: $$p{\theta}(z|c) = p{\theta}(y| c) \left| det \frac{\partial y}{\partial z} \right| = N(y| \mu{\theta}(c), \sigma{\theta}(c) ) \left| det \frac{\partial y}{\partial z} \right|$$
Typically, $f{\theta}(z) = f{k} \circ f{k-1} \circ ... \circ f{1}$ where $f_{i}$ are bijective and differentiable. In this model we have $k=4$ flows
$$log\left(p{\theta}(z|c)\right) = log\left(p{\theta}(y| c)\right) + \sum{i=1}^{k} log\left| det \frac{\partial y{i}}{\partial z_{i}} \right|$$
Since our flow transforms noise $e_q$ to [u, v], we have
$$log\left(q_{\theta}(eq | d, c)\right) = log\left(q{\theta}(u, v| d, c)\right) + \sum{i=1}^{k} log\left| det \frac{\partial uv{i}}{\partial {e{q}}{i}} \right|$$
$$log\left(q{\theta}(u, v| d, c)\right) = log\left(q{\theta}(eq | d, c)\right) - \sum{i=1}^{k} log\left| det \frac{\partial uv{i}}{\partial {e{q}}_{i}} \right| $$
As in paper Loss_dur is the negative variational lower bound of equation 7
Loss_dur = -log(p/q)=logq - logp
Is the logq(x) computed as follows:
In the code:
Why in the computation of logq, " - logdet_tot_q" not " + logdet_tot_q" ?