idiap / coqui-ai-TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
https://coqui-tts.readthedocs.io
Mozilla Public License 2.0
623 stars 71 forks source link

Add VITS 2 model #123

Open Marioando opened 4 weeks ago

Marioando commented 4 weeks ago

Hi, I'm working on adding vits2 model to coqui framework, while testing the implementation, I found out that the model does train well on single gpu, but as soon as the second step in multigpu training all loss are normal , i.e loss0 and loss1, exept loss2 which is the loss of the duration discriminator layer( it became nan). So here is my question , do you think I need to modify the trainer or modify the batch sampler in the model. Also I have made some change to the trainer to filter out null gradient in multigpu but doesnt work . Here is what I have already tryed : decrease learning rate for the duration discriminator, add gradient clipping, decrease batch size to 1 for testing none work on multigpu. The model seems to learn well on single gpu setup.
Thanks

Marioando commented 4 weeks ago

--> TIME: 2024-11-01 07:59:12 -- STEP: 199/406 -- GLOBAL_STEP: 100200 | > loss_disc: 2.293909788131714 (2.353372510354123) | > loss_disc_real_0: 0.050190214067697525 (0.09111052809573301) | > loss_disc_real_1: 0.22900593280792236 (0.20297033208698484) | > loss_disc_real_2: 0.2125558704137802 (0.220549658090625) | > loss_disc_real_3: 0.2014939934015274 (0.22777624622960785) | > loss_disc_real_4: 0.2580271363258362 (0.22694031534782008) | > loss_disc_real_5: 0.23579958081245422 (0.23088655137836034) | > loss_0: 2.293909788131714 (2.353372510354123) | > grad_norm_0: tensor(38.8719, device='cuda:0') (tensor(168.8013, device='cuda:0')) | > loss_gen: 2.4380598068237305 (2.5592831391185973) | > loss_kl: 3.0022356510162354 (5.0805860691933145) | > loss_feat: 5.34114408493042 (5.2965420885900745) | > loss_mel: 20.770143508911133 (21.53628662722793) | > loss_duration: 1.849429965019226 (1.862948954404898) | > loss_1: 33.4010124206543 (36.335646701218515) | > grad_norm_1: tensor(815.8643, device='cuda:0') (tensor(1645.5496, device='cuda:0')) | > loss_dur_disc: nan | > loss_dur_disc_real_0: nan | > amp_scaler: 64.0 (227.05527638190944) | > loss_2: nan | > grad_norm_2: tensor(0) (tensor(0)) | > current_lr_0: 0.0002 | > current_lr_1: 0.0002 | > current_lr_2: 0.0002 | > step_time: 1.8149 (1.429261895280387) | > loader_time: 0.0206 (0.015218985140623162)

Marioando commented 4 weeks ago
if optimizer_idx == 2:

            output_prob_for_real, output_probs_for_pred = self.dur_disc(
                self.model_outputs_cache['hidden_encoded_text'],
                self.model_outputs_cache['hidden_encoded_text_mask'],
                self.model_outputs_cache['real_durations'],  # logscaled
                self.model_outputs_cache['predicted_durations'] # logscaled
            )

            outputs = {
                "real_durations": self.model_outputs_cache['real_durations'],  # logscaled
                "predicted_durations": self.model_outputs_cache['predicted_durations']  # logscaled
            }

            with autocast(enabled=False):
                loss_dict = criterion[optimizer_idx](
                    output_prob_for_real,
                    output_probs_for_pred,
                )

            return outputs, loss_dict
eginhard commented 3 weeks ago

Cool, would be happy to add Vits 2! Are you basing it on the initial work from @p0p4k in https://github.com/coqui-ai/TTS/pull/3355?

Impossible to say why something isn't working without seeing any code. But I'm fine with merging something that just works with one GPU for now. It could be improved later. The original Vits had some issues with multi-GPU as well (#103), are these the same issues?

Marioando commented 3 weeks ago

I'll try to fix it before a PR. I can say that vits2 is a massive improvement on vits, at least to my ears, the model seems to be way more robust than vits, In my implementation, vits model trained with coqui can be trained as vits 2 by reiniting dp and text encoder at the beginning of the training, which allows me do compare the models. I didnt use the prototype form p0p4k, it was way easier to start from the original vits in coqui. I'm currently busy trying to add @p0p4k pflow implementation and this is my priority but I will try to work on the model as soon as possible. Thank you for your time! I think coqui framework does make experimenting with tts way faster ! We appreciate your work maintaining this repo! Thank you.

p0p4k commented 3 weeks ago

Thanks for doing this work guys. If you need any other paper implementation or need assistance with porting to coqui lmk.

Marioando commented 3 weeks ago

@p0p4k how do we know when to freeze the duration discriminator in vits2 and also when to remove the noise from mas.

p0p4k commented 3 weeks ago

@Marioando For duration discriminator, do you mean freeze before we start it to train or freeze after the MAS is trained for sometime and gives accurate results? The number of steps to remove noise from MAS might be experimental, id say 10k steps should be fine.

Marioando commented 3 weeks ago

@p0p4k I thought that the vits 2 paper said they trained the duration disc for 30k step, I reread the paper again and it was duration predictor so we dont need to freeze the duration disc, just freeze the duration predictor after we got good result. Right!?

p0p4k commented 3 weeks ago

Right, I was thinking that initially the mas is still waiting for text embeddings to get to a reasonable place to give the right gt durations and so we can wait for it to stabilize first and then begin the duration disc to train.

Marioando commented 3 weeks ago

@eginhard I have made a PR for vits2 here is some audio from the model : vits2_audio_samples.zip.tar.gz It's still not perfect but I think we can improve it. Concerning the multigpu training issues will computing loss1 with loss2 help i.e : using only two optimizer.

Marioando commented 3 weeks ago

I trained the model using d-vector.