p0p4k / vits2_pytorch

unofficial vits2-TTS implementation in pytorch
https://arxiv.org/abs/2307.16430
MIT License
477 stars 85 forks source link

Where is L_mse? #72

Closed OedoSoldier closed 9 months ago

OedoSoldier commented 9 months ago

The paper has three losses for the duration discriminator: $L{adv}(D)$, $L{adv}(G)$, and $L{mse}$. But the code only implemented $L{adv}(D)$ and $L_{adv}(G)$:

https://github.com/p0p4k/vits2_pytorch/blob/1f4f3790568180f8dec4419d5cad5d0877b034bb/train_ms.py#L418

https://github.com/p0p4k/vits2_pytorch/blob/1f4f3790568180f8dec4419d5cad5d0877b034bb/train_ms.py#L448

$L_{mse}$ should be:

loss_mse = F.mse_loss(logw, logw_)

And add to loss_gen_all.

https://github.com/p0p4k/vits2_pytorch/blob/1f4f3790568180f8dec4419d5cad5d0877b034bb/train_ms.py#L446

p0p4k commented 9 months ago

Did you find it inside models.py ?

OedoSoldier commented 9 months ago

Did you find it inside models.py ?

Yes, I found the l_length👌

JohnHerry commented 9 months ago

I am not very clear about the l_length from SDP, is it a MSE loss?

OedoSoldier commented 9 months ago

I am not very clear about the l_length from SDP, is it a MSE loss?

SDP is flow-based and should not trained with DD.

JohnHerry commented 9 months ago

Did you find it inside models.py ?

Yes, I found the l_length👌

What about the l_length for sdp? should this line put outside the else part?

l_length = torch.sum((logw - logw_) ** 2, [1,2]) / torch.sum(x_mask)

I am not very clear about the l_length from SDP, is it a MSE loss?

SDP is flow-based and should not trained with DD.

Yes, but as the paper said, the MSE loss should be one part of the the SDP loss in training. is the flow loss somewhat a kind of MSE loss?

p0p4k commented 9 months ago

So, in sdp, we use a normalizing flow to send the discrete frame numbers (durations) to a gaussian. The l_length in sdp is a negative log likelihood that should be minimized in order to make sure that the flown-numbers are a sample of gaussian. During inference, we give it a noise and the text, and it gives out the durations (using reverse flow).

JohnHerry commented 9 months ago

So, in sdp, we use a normalizing flow to send the discrete frame numbers (durations) to a gaussian. The l_length in sdp is a negative log likelihood that should be minimized in order to make sure that the flown-numbers are a sample of gaussian. During inference, we give it a noise and the text, and it gives out the durations (using reverse flow).

Is that means we will get no help when adding a MSE( SDP(text, noise, reverse), d) to the sdp training loss? and the paper mensioned MSE is just for DP but not for SDP?

p0p4k commented 9 months ago

No, we can do that by sending a gaussian noise and text to sdp, get the durations output, and then mse with the real (MAS) durations. I have not added it in the repo, but it is possible to do it easily.

p0p4k commented 9 months ago

I think vits2 uses sdp in paper. 'z_d' is the noise for sdp.

JohnHerry commented 9 months ago

I think vits2 uses sdp in paper. 'z_d' is the noise for sdp.

Thanks for the help, I will have a try at convenience, I think evan when we add the MSE to the sdp predicted logw, it will be at the second stage when training SDP with DPD for the last 30K steps.

JohnHerry commented 9 months ago

I think vits2 uses sdp in paper. 'z_d' is the noise for sdp.

Thanks for the help, I will have a try at convenience, I think evan when we add the MSE to the sdp predicted logw, it will be at the second stage when training SDP with DPD for the last 30K steps.