p0p4k / vits2_pytorch

unofficial vits2-TTS implementation in pytorch
https://arxiv.org/abs/2307.16430
MIT License
471 stars 84 forks source link

I am trying to use the vits's New FLOW in the svc #22

Closed Hanxibird closed 1 year ago

Hanxibird commented 1 year ago

Hello, I am extremely grateful for your contribution to VITS2. I have been working in SVC for a long time, and recently I have become interested in the FLOW of VITS2.

I have coded a FLOW in SVC for VITS2, which is similar to your pre-conv FLOW. However, the results were slightly different, and I am unable to determine if it is progressing as intended in SVC.

I found that you mentioned that mono-FLOW aligns with the author's intuition. So, I would like to know if it is the best method for FLOW in VITS2? I will attempt to train it for some days.

p0p4k commented 1 year ago

Hi, in my opinion all the 3 methods are capturing the essence of long range transformer dependency. The only difference is which of the 3 is able to generate more complex distributions from the Gaussian posterior. Unfortunately the best way is to train all 3 and compare the results. But in short term, you can choose any style and train the model.

I have coded a FLOW in SVC for VITS2, which is similar to your pre-conv FLOW. However, the results were slightly different, and I am unable to determine if it is progressing as intended in SVC.

What does it mean by slightly different? Better or worse?

Hanxibird commented 1 year ago

Thanks for your reply.Yes,I means better or worse. My current research is about mining the hidden information of emotion, pitch and other speech in VITS.I found that a lot of hidden information was lost during the propagation of FLOW, such as pitch, mood and loudness.I'm expecting the new FLOW to behave differently。

Hanxibird commented 1 year ago

The new Flow has a more distinct sound spectrum.But i can't hear the difference between the song. QQ截图20230825142642 QQ截图20230825142626

Hanxibird commented 1 year ago

The second is the speech generated by the new FLOW, which is clearer than the former in terms of loudness and pitch, but there is little difference in hearing

p0p4k commented 1 year ago

I can give one quick suggestion is to increase the number of flow blocks. Make it really big number and then cut down gradually and see if there is quality difference. This can give a good estimate of how the flow takes part in modifying the data distribution.

Hanxibird commented 1 year ago

Great thanks for your suggestion!I will try it.

p0p4k commented 1 year ago

What is believe is that normalizing flow is "harsh" diffusion. They are very close concepts. CNF is better than diffusion supposedly in voiceboxby Meta. So, that must be the next logical step to replace our residual flows with CNF for better expressiveness. Also, the longer flows must mimic diffusion is my guess, that's why increasing flow length should help. Let me know the results! Thanks.

p0p4k commented 1 year ago

I am changing the flow layers a bit more since I think I made slight error in earlier implementation. Wait for my code till tomorrow.

Hanxibird commented 1 year ago

Thank you for your response. You're absolutely right, and this made me realize that perhaps the continuous transformation of CNF is more suitable for voice conversion compared to the FLOW used in the VITS paper. I am currently debugging and testing the new FLOW for SVC based on your method, and I might conduct research on CNF in the future. I will keep you updated if I make any discoveries. Thank you again for your invaluable assistance.

p0p4k commented 1 year ago

In VITS-2, voice conversion will not work as the TextEncoder is conditioned on "g" as well. I am thinking of some solution there, maybe the sdp type layer that takes in "g" and give out "noise" can be used to condition TextEncoder making it reversible. That way, while doing voice conversion we can inverse flow and get the TextEncoder's uncondtioned output, recondition it on g and then voice convert. What are your thoughts? Add me on discord : p0p4k to discuss further.

p0p4k commented 1 year ago

In this line z_p = self.flow(z, y_mask, g=g_src) , that code is from VITS-1. The z is flowed to complex distribution that is supposed to come out of the TextEncoder and length regulation (upsampling based on duration pred). Then this z_p is independent of speaker conditioning g in VITS-1. However, it is NOT the case in VITS-2. So, voice conversion in this way is not possible. If we can reverse till the pre-conditioned layer in TextEncoder, then it is can be, but we cannot do that in current architecture. Hope this helps!

Hanxibird commented 1 year ago

There may indeed be some problems in voice conversion. The effect of voice conversion in VTS-1 I have tried is not satisfactory, and the effect needs to be combined with some other networks. Here, I have not carefully considered the voice conversion in VTS-2, and I think the voice conversion task is not too dependent on duration prediction. But I think reverse till the pre-conditioned layer in TextEncoder is a good idea,I will try to implement this in my experiment in a few days.

Hanxibird commented 1 year ago

I've already trained a model of a song transfer using FLOW of VTS-2, and it does sound slightly different, and it feels better. But it seems the new FLOW is also sensitive to noise.I think there are still some problems in my code, maybe about the dimension and the number of layers. Recently, I am going to adjust the number of layers of FLOW to compare the current effect. By the way, consider what the Text encoder section has that can be improved by reference to the VITS-2.

Hanxibird commented 1 year ago

Add me on discord : p0p4k to discuss further.

I just sent an application to you on discord. I'm sorry for not replying in time. I was preparing a paper for school two days ago

Yaodada12 commented 6 months ago

I've already trained a model of a song transfer using FLOW of VTS-2, and it does sound slightly different, and it feels better. But it seems the new FLOW is also sensitive to noise.I think there are still some problems in my code, maybe about the dimension and the number of layers. Recently, I am going to adjust the number of layers of FLOW to compare the current effect. By the way, consider what the Text encoder section has that can be improved by reference to the VITS-2.

Hi, any progress recently? After using the new flow model, how much can the effect of SVC be improved? Did you use VCTK to train the baseline model? How long does it take to train?