p0p4k / vits2_pytorch

unofficial vits2-TTS implementation in pytorch
https://arxiv.org/abs/2307.16430
MIT License
477 stars 85 forks source link

Duration Discriminator problem #68

Closed yiwei0730 closed 11 months ago

yiwei0730 commented 11 months ago

In trainms.py line 408-411,The duration Discriminator Net in the last two input , Is the order reversed? I saw the other vits2 code order reversed, but logw is the MAS result logw is the dp result, I think this is correct. Can you help me answer it?

Duration Discriminator

        if net_dur_disc is not None:
            y_dur_hat_r, y_dur_hat_g = net_dur_disc(
                hidden_x.detach(), x_mask.detach(), logw_.detach(), logw.detach()
            )

->

another Duration Discriminator

       if net_dur_disc is not None:
            y_dur_hat_r, y_dur_hat_g = net_dur_disc(
                hidden_x.detach(), x_mask.detach(), logw.detach(), logw_.detach()
            )
p0p4k commented 11 months ago

logw is fake (coming from dp); logw_ is 'real' (coming from MAS during training only) Discriminator takes in real, fake gives real's_prob (prob that the real is real), fake's_prob (prob that fake is real).

yiwei0730 commented 11 months ago

Yes, you are right. I see the another Duration Discriminator in bert-vits2 repo, so its make me confuse. Thanks for your reply!!

p0p4k commented 11 months ago

They all are my copies 🤣 it is when I had made mistakes at initial commit.

deyituo commented 10 months ago

hh, I use bert-vits2 to train base model on genshin+starrail dataset and find that loss_dur_gen is 40+ and to cause the loss nan. But when training on paimon, the loss is smaller. Finally I found here, maybe this is the issue.

yiwei0730 commented 10 months ago

hh, I use bert-vits2 to train base model on genshin+starrail dataset and find that loss_dur_gen is 40+ and to cause the loss nan. But when training on paimon, the loss is smaller. Finally I found here, maybe this is the issue.

yes they make the prediction to be the real one. There system is perfect, but still have small bugs.

deyituo commented 10 months ago

I find that the dur discriminator loss is still large, em...

deyituo commented 10 months ago

vits2's paper says that just train for 30k steps, en...

JohnHerry commented 10 months ago

hh, I use bert-vits2 to train base model on genshin+starrail dataset and find that loss_dur_gen is 40+ and to cause the loss nan. But when training on paimon, the loss is smaller. Finally I found here, maybe this is the issue.

yes they make the prediction to be the real one. There system is perfect, but still have small bugs.

Hi,yiwei0730, can you share me other bugs in the Bert-VITS2 please? I had experienced wrong phoneme pronunciation when using it. I had considered that the BERT feature had been disturbing the phoneme features so that some phonemes are wrongly uttered. But now you point out that there are some bugs. so I am interested in this.

yiwei0730 commented 10 months ago

they use SDP/DP ratio to train, but i'm not sure this is correct. But I can choose the duration ratio in inference, maybe is a good idea. and also in 2.1 they still didn't found the logw_ problem. There preprocess if you are focus on chinese, maybe have some wrong in preprocess. I'm not sure if any problem in the code inside the model, i'm not read all the code yet.

JohnHerry commented 10 months ago

they use SDP/DP ratio to train, but i'm not sure this is correct. But I can choose the duration ratio in inference, maybe is a good idea. and also in 2.1 they still didn't found the logw_ problem. There preprocess if you are focus on chinese, maybe have some wrong in preprocess. I'm not sure if any problem in the code inside the model, i'm not read all the code yet.

I had tried the v1.0.1 because I do not need multi-lingual support. Yes I have the some doubt that they are training the SDP and DP together by sum them up with ratio. I do not know if that is the reason of those bad pronunciations. It seems that they are trying to add new features to the project to do more, but no model optimization and improvement at all. I am now trying to fix the logw_ problem and disable the add_blank config in Bert-VITS2, hope that will be helpfull.

deyituo commented 10 months ago

@JohnHerry Maybe you should start with at least v2.0.2-fix, v1.x is not good enough in quality

deyituo commented 10 months ago

After I remove the duration discriminator to train multi/single corpus, the loss seems to be more like other projects' logs.

gensin+starrail image image

paimon: image image

JohnHerry commented 10 months ago

@JohnHerry Maybe you should start with at least v2.0.2-fix, v1.x is not good enough in quality

When I start, It newest version is v2.0.1, I had compared most of the import files: commons.py losses.py models.py modules.py, transforms.py, and there is no important difference with v1.0.1, the changes are about multi-lingual support which I am not care about. so I used the v1.0.1 as my base.

JohnHerry commented 10 months ago

After I remove the duration discriminator to train multi/single corpus, the loss seems to be more like other projects' logs.

gensin+starrail image image

paimon: image image

Yes, remove the DD will make the training process more stable, but as an important point in the vits2 paper, I did not disable this, I follow the paper to train the acount model for about 700-800K steps and then enable DD and train them together, after the DD traning, the bad pronunciation problem get better, but still exist.

p0p4k commented 10 months ago

@JohnHerry are you using sdp and dur_disc OR deterministic dp and dur_disc?

JohnHerry commented 10 months ago

@JohnHerry are you using sdp and dur_disc OR deterministic dp and dur_disc?

I was using Bert-VITS2 v1.0.1, it makes DP by DP = ( sdp sdp_ratio + dp (1-sdp_ratio)), Yes, it training and inference with those two DPs together, strange but usfull. sdp_ratio =1 make the result more natural, sdp_raio=0 make the result more stable. as we know, the SDP will sometimes generate very bad result.

p0p4k commented 10 months ago

I see, interesting approach. Thanks for sharing, I think sdp_ratio can be made into a learnable parameter during training itself.

JohnHerry commented 10 months ago

I see, interesting approach. Thanks for sharing, I think sdp_ratio can be made into a learnable parameter during training itself.

Oh, wonderful if it is made learnable, but I think it is not an easy task. In my understanding, some good jobs like this vits2 and StyleTTS2, are using exactly tow-stage training. The first stage they train a roughly usable acounstic model, and the seoncd stage is to train a better model. VITS2 firstly learn 800K steps, and then learn a better DP with the DD; the StyleTTS2 first train many epoches on the acounstic model and then the DDPM based Speech Style predictor. As to the sdp_ratio, I think, if it is trainable, then this traning should start at somepoint that the SDP and DP are roughly usable, then the training pipeline may become more complex.