Closed yiwei0730 closed 1 year ago
logw
is fake (coming from dp); logw_
is 'real' (coming from MAS during training only)
Discriminator takes in real, fake gives real's_prob (prob that the real is real), fake's_prob (prob that fake is real).
Yes, you are right. I see the another Duration Discriminator in bert-vits2 repo, so its make me confuse. Thanks for your reply!!
They all are my copies 🤣 it is when I had made mistakes at initial commit.
hh, I use bert-vits2 to train base model on genshin+starrail dataset and find that loss_dur_gen is 40+ and to cause the loss nan. But when training on paimon, the loss is smaller. Finally I found here, maybe this is the issue.
hh, I use bert-vits2 to train base model on genshin+starrail dataset and find that loss_dur_gen is 40+ and to cause the loss nan. But when training on paimon, the loss is smaller. Finally I found here, maybe this is the issue.
yes they make the prediction to be the real one. There system is perfect, but still have small bugs.
I find that the dur discriminator loss is still large, em...
vits2's paper says that just train for 30k steps, en...
hh, I use bert-vits2 to train base model on genshin+starrail dataset and find that loss_dur_gen is 40+ and to cause the loss nan. But when training on paimon, the loss is smaller. Finally I found here, maybe this is the issue.
yes they make the prediction to be the real one. There system is perfect, but still have small bugs.
Hi,yiwei0730, can you share me other bugs in the Bert-VITS2 please? I had experienced wrong phoneme pronunciation when using it. I had considered that the BERT feature had been disturbing the phoneme features so that some phonemes are wrongly uttered. But now you point out that there are some bugs. so I am interested in this.
they use SDP/DP ratio to train, but i'm not sure this is correct. But I can choose the duration ratio in inference, maybe is a good idea. and also in 2.1 they still didn't found the logw_ problem. There preprocess if you are focus on chinese, maybe have some wrong in preprocess. I'm not sure if any problem in the code inside the model, i'm not read all the code yet.
they use SDP/DP ratio to train, but i'm not sure this is correct. But I can choose the duration ratio in inference, maybe is a good idea. and also in 2.1 they still didn't found the logw_ problem. There preprocess if you are focus on chinese, maybe have some wrong in preprocess. I'm not sure if any problem in the code inside the model, i'm not read all the code yet.
I had tried the v1.0.1 because I do not need multi-lingual support. Yes I have the some doubt that they are training the SDP and DP together by sum them up with ratio. I do not know if that is the reason of those bad pronunciations. It seems that they are trying to add new features to the project to do more, but no model optimization and improvement at all. I am now trying to fix the logw_ problem and disable the add_blank config in Bert-VITS2, hope that will be helpfull.
@JohnHerry Maybe you should start with at least v2.0.2-fix, v1.x is not good enough in quality
After I remove the duration discriminator to train multi/single corpus, the loss seems to be more like other projects' logs.
gensin+starrail
paimon:
@JohnHerry Maybe you should start with at least v2.0.2-fix, v1.x is not good enough in quality
When I start, It newest version is v2.0.1, I had compared most of the import files: commons.py losses.py models.py modules.py, transforms.py, and there is no important difference with v1.0.1, the changes are about multi-lingual support which I am not care about. so I used the v1.0.1 as my base.
After I remove the duration discriminator to train multi/single corpus, the loss seems to be more like other projects' logs.
gensin+starrail
paimon:
Yes, remove the DD will make the training process more stable, but as an important point in the vits2 paper, I did not disable this, I follow the paper to train the acount model for about 700-800K steps and then enable DD and train them together, after the DD traning, the bad pronunciation problem get better, but still exist.
@JohnHerry are you using sdp and dur_disc OR deterministic dp and dur_disc?
@JohnHerry are you using sdp and dur_disc OR deterministic dp and dur_disc?
I was using Bert-VITS2 v1.0.1, it makes DP by DP = ( sdp sdp_ratio + dp (1-sdp_ratio)), Yes, it training and inference with those two DPs together, strange but usfull. sdp_ratio =1 make the result more natural, sdp_raio=0 make the result more stable. as we know, the SDP will sometimes generate very bad result.
I see, interesting approach. Thanks for sharing, I think sdp_ratio
can be made into a learnable parameter during training itself.
I see, interesting approach. Thanks for sharing, I think
sdp_ratio
can be made into a learnable parameter during training itself.
Oh, wonderful if it is made learnable, but I think it is not an easy task. In my understanding, some good jobs like this vits2 and StyleTTS2, are using exactly tow-stage training. The first stage they train a roughly usable acounstic model, and the seoncd stage is to train a better model. VITS2 firstly learn 800K steps, and then learn a better DP with the DD; the StyleTTS2 first train many epoches on the acounstic model and then the DDPM based Speech Style predictor. As to the sdp_ratio, I think, if it is trainable, then this traning should start at somepoint that the SDP and DP are roughly usable, then the training pipeline may become more complex.
In trainms.py line 408-411,The duration Discriminator Net in the last two input , Is the order reversed? I saw the other vits2 code order reversed, but logw is the MAS result logw is the dp result, I think this is correct. Can you help me answer it?
Duration Discriminator
->
another Duration Discriminator