Add VITS 2 model - Githubissues

Marioando commented 4 weeks ago

Hi, I'm working on adding vits2 model to coqui framework, while testing the implementation, I found out that the model does train well on single gpu, but as soon as the second step in multigpu training all loss are normal , i.e loss0 and loss1, exept loss2 which is the loss of the duration discriminator layer( it became nan). So here is my question , do you think I need to modify the trainer or modify the batch sampler in the model. Also I have made some change to the trainer to filter out null gradient in multigpu but doesnt work . Here is what I have already tryed : decrease learning rate for the duration discriminator, add gradient clipping, decrease batch size to 1 for testing none work on multigpu. The model seems to learn well on single gpu setup.
Thanks

Marioando commented 4 weeks ago

--> TIME: 2024-11-01 07:59:12 -- STEP: 199/406 -- GLOBAL_STEP: 100200 | > loss_disc: 2.293909788131714 (2.353372510354123) | > loss_disc_real_0: 0.050190214067697525 (0.09111052809573301) | > loss_disc_real_1: 0.22900593280792236 (0.20297033208698484) | > loss_disc_real_2: 0.2125558704137802 (0.220549658090625) | > loss_disc_real_3: 0.2014939934015274 (0.22777624622960785) | > loss_disc_real_4: 0.2580271363258362 (0.22694031534782008) | > loss_disc_real_5: 0.23579958081245422 (0.23088655137836034) | > loss_0: 2.293909788131714 (2.353372510354123) | > grad_norm_0: tensor(38.8719, device='cuda:0') (tensor(168.8013, device='cuda:0')) | > loss_gen: 2.4380598068237305 (2.5592831391185973) | > loss_kl: 3.0022356510162354 (5.0805860691933145) | > loss_feat: 5.34114408493042 (5.2965420885900745) | > loss_mel: 20.770143508911133 (21.53628662722793) | > loss_duration: 1.849429965019226 (1.862948954404898) | > loss_1: 33.4010124206543 (36.335646701218515) | > grad_norm_1: tensor(815.8643, device='cuda:0') (tensor(1645.5496, device='cuda:0')) | > loss_dur_disc: nan | > loss_dur_disc_real_0: nan | > amp_scaler: 64.0 (227.05527638190944) | > loss_2: nan | > grad_norm_2: tensor(0) (tensor(0)) | > current_lr_0: 0.0002 | > current_lr_1: 0.0002 | > current_lr_2: 0.0002 | > step_time: 1.8149 (1.429261895280387) | > loader_time: 0.0206 (0.015218985140623162)

Marioando commented 4 weeks ago

if optimizer_idx == 2:

            output_prob_for_real, output_probs_for_pred = self.dur_disc(
                self.model_outputs_cache['hidden_encoded_text'],
                self.model_outputs_cache['hidden_encoded_text_mask'],
                self.model_outputs_cache['real_durations'],  # logscaled
                self.model_outputs_cache['predicted_durations'] # logscaled
            )

            outputs = {
                "real_durations": self.model_outputs_cache['real_durations'],  # logscaled
                "predicted_durations": self.model_outputs_cache['predicted_durations']  # logscaled
            }

            with autocast(enabled=False):
                loss_dict = criterion[optimizer_idx](
                    output_prob_for_real,
                    output_probs_for_pred,
                )

            return outputs, loss_dict

eginhard commented 3 weeks ago

Cool, would be happy to add Vits 2! Are you basing it on the initial work from @p0p4k in https://github.com/coqui-ai/TTS/pull/3355?

Impossible to say why something isn't working without seeing any code. But I'm fine with merging something that just works with one GPU for now. It could be improved later. The original Vits had some issues with multi-GPU as well (#103), are these the same issues?