Closed heesuju closed 1 year ago
I think you just need to train longer, one your duration predictor is around 0.2, and mel between 12-15, you can expect good results.
I think you just need to train longer, one your duration predictor is around 0.2, and mel between 12-15, you can expect good results.
Thank you for your help!
I tried training again with the same dataset after changing the following:
The mel loss did get lower, and the audio quality improved considerably, sounding more natural.
Could you tell me if I'm understanding this correctly?
Again, thank you for your help!!
-- i am trying to implement vits2 and it will be helpful if you can try that model and give me some feedback as well if possible. Thanks.
Thank you! That certainly cleared alot of things up for me. I've been wanting to try vits2, especially with all the improvements to multi-speaker training. I will absolutely try it as soon as it's out! Best of luck and thank you for all your help!
Hello, I'm currently training a Korean multi-speaker model with 5 speakers. The batch size is 50, with 8 hours of data that has a sampling rate of 22050. However, I have no idea how to interpret the following results in tensorboard.
From what I can understand, vits uses GAN, so it has a generator and a discriminator. If the total discriminator loss goes down, it is getting better at finding fake sounds, motivating the generator to get better.
However, whenever I train a new model, the two losses diverge with discriminator overpowering the generator. This does not seem to be a problem in another issue posted here: https://github.com/jaywalnut310/vits/issues/13#issuecomment-904314885
I was hoping there would be a more definite answer for the following questions:
Any help would be appreciated. Thank you.