Add duration predictor training

ductho9799 commented 1 year ago

Hello p0p4k! Your repository is very awesome. I trained VITS2 with your code on my private data. I have implemented duration predictor training code. You can test it.

p0p4k commented 1 year ago

I have implemented duration predictor training code. You can test it.

Hi, I will check and review the code ASAP.

I trained VITS2 with your code on my private data.

How are the results? Can you share some samples? No need to share the weights, just wav samples if possible, to see the output quality. Thanks!

ductho9799 commented 1 year ago

I haven't had time to experiment with LJSpeech yet. I just tested with my private Vietnamese dataset. The result of VITS2 after training duration predictor is better than VITS with my dataset. Here are some samples created by VITS, VITS2 and Ground Truth: VITS: vits VITS2: vits2 Human: gt

p0p4k commented 1 year ago

Thanks for the samples. They do sound good. Can I ask if you transferred VITS-1 weights to VITS-2 or trained VITS-2 from scratch?

ductho9799 commented 1 year ago

I trained VITS-2 from-scratch. Here is my configs: vits-2-configs.json. I trained it on 4 RTX 3090 24 GVRAM

p0p4k commented 1 year ago

Interesting! Can I add your samples on README of this repo? I still would advice to add discriminator and train the model. Also, would be great if you can turn on the other flags and check any improvement in the output? Thanks!

ductho9799 commented 1 year ago

Thanks for your suggestions. I'm planning to train VITS-2 with the LJSpeech dataset next week. I will send you the checkpoint of LJSpeech and generated samples.

p0p4k commented 1 year ago

Hi, I updated the code with 2 discriminators; please check it if you are interested.

ductho9799 commented 1 year ago

Thank you so much for updating the new discriminators. I will test and train with new discriminators. I'll share the result with you as soon as possible.

egorsmkv commented 1 year ago

@ductho9799 hello. What was improved in speech? I'm curious just pronunciation or other characteristics of the voice?

ductho9799 commented 1 year ago

@p0p4k @egorsmkv Hello, I trained a version of VITS-2 with the LJSpeech dataset. I share the weights, config, and audio samples of VITS-2 in VITS-2. Can you help me evaluate the quality of VITS-2 on LJSpeech dataset?

I trained VITS-2 with 390 epochs and the trained duration predictor with 200 epochs.

p0p4k commented 1 year ago

@ductho9799 change access of your drive file. Thanks.

ductho9799 commented 1 year ago

Yes, try again it, please.

p0p4k commented 1 year ago

Thanks for sharing the checkpoints! Samples sound not bad! Can you train the latest code with duration discriminator and HIFIGan Discriminator (multiperiod disc) with nosdp?

p0p4k commented 1 year ago

I am booting a cloud GPU right now to train as well. I want to check if the duration discriminator is working or not. (no nan, inf values, etc)

ductho9799 commented 1 year ago

Thanks for sharing the checkpoints! Samples sound not bad! Can you train the latest code with duration discriminator and HIFIGan Discriminator (multiperiod disc) with nosdp? I can train this config at the weekend.

p0p4k commented 1 year ago

If the training works well, I will share the checkpoints so you can continue to train on that; else will try to fix the code before weekend.

p0p4k commented 1 year ago

@ductho9799 checkpoints are on main page readme. Good luck!

kingkong135 commented 11 months ago

I trained VITS-2 from-scratch. Here is my configs: vits-2-configs.json. I trained it on 4 RTX 3090 24 GVRAM

@ductho9799 Can you share file symbols.py, i trained in infore dataset but result not good. I used config like you. :((. All config, model and train.log in drive. Can you give me some advice? Thank you very much.

ngocson1804 commented 6 months ago

@ductho9799 have you tried with an external embedding extractor?

HuuHuy227 commented 6 months ago

@ductho9799 have you tried with an external embedding extractor?

Did you mean bert-vit2s?

TalapMukhamejan commented 1 month ago

@ductho9799 Can you please share your symbols? Cause while trying to inference it, I am getting this error RuntimeError: Error(s) in loading state_dict for SynthesizerTrn: size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([184, 192]) from checkpoint, the shape in current model is torch.Size([178, 192]).

p0p4k / vits2_pytorch

Add duration predictor training #10