jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.65k stars 1.22k forks source link

Stochastic duration prediction failed for fastspeech2 #49

Open LEECHOONGHO opened 2 years ago

LEECHOONGHO commented 2 years ago

I applied the stochastic duration predictor to the fastspeech2 model.

Duration loss is falling smoothly (1.2 to 0.2) image

But, in inference, the duration predictor does not work at all. (noise scale=0.333) image

Does anyone know the cause of this problem? The pseudo code I used is like below

# in variance adaptor
inputs = text_encoder_output + extended_speaker_embedding
sdp_mask = torch.unsqueeze(sequence_mask(text_lens, inputs.shape[-1]), 1).to(inputs.dtype)

if training:
    duration_prediction = self.duration_predictor(
        inputs , sdp_mask, torch.log(attn_hard_dur.float() + 1).unsqueeze(1)
    )
    duration_prediction = duration_prediction / torch.sum(sdp_mask)
else:
    duration_prediction = self.duration_predictor(inputs , sdp_mask, reverse=True, noise_scale=0.333)
    duration_prediction = duration_prediction.squeeze(1)

duration_rounded = torch.clamp(
                (torch.round(torch.exp(duration_prediction) - 1) * d_control),
                min=1,
            )

# loss
duration_loss = torch.sum(duration_prediction.float())
OnceJune commented 2 years ago

How's the synth result with fs2 duration predictor after the same steps of training? And also, in fs2 training, grad from duration predictor is passed to encoder, while in vits, it used x.detach() to cut off grad, I think this might also be taken into consideration. https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L51

LEECHOONGHO commented 2 years ago

@OnceJune Thanks for your Reply. fastspeech2 duration predictor works well. Audio sample synthesized by ddp is like below. https://user-images.githubusercontent.com/44384060/154802366-3e1a959f-8652-4adb-95f8-f234ceb09d87.mp4

I think that's a very good point. However, as mentioned in paper, I am afraid that the loss obtained from the noise of SDP could affect adversely to text encoder(like mispronunciation). I'll test this out and report if result is good.

blx0102 commented 1 year ago

@LEECHOONGHO Hi mate, have you sucessfully applied SDP to fs2?

godefv commented 1 year ago

I have sometimes the same issue in VITS. My workaround is to launch the training again from scratch.