Controllable and Interpretable Singing Voice Decomposition via Assem-VC

AK391 commented 3 years ago

just saw this paper https://arxiv.org/abs/2110.12676, when will the repo be updated for this thanks

980202006 commented 3 years ago

Is there any detail about the speaker embedding? Such as What model is used to generate it, whether it is pre-trained, and what data set is used

wookladin commented 3 years ago

@AK391 Thanks for your interest! Currently, we don't have a specific plan to release the code of that paper. We will add the link to the paper and demo page at README soon.

@980202006 We just used nn.Embedding without pre-training. Thanks!

980202006 commented 3 years ago

Thank you!

iehppp2010 commented 2 years ago

@wookladin

I have tried to reproduce this paper. My train mse mel loss reaches 0.18, while dev mel loss stops at 0.75.

After train the 'Decoder' model, I use this model to do GTA fine finetuning on the HiFi-GAN model you provide. Below is the HiFi-GAN model fine tuning loss

After that, I try to control speaker Identity by just switching the speaker embedding to target speaker, which is the the way you said in the paper. I use a trained audio of CSD female speaker as the reference audio(the link below). https://drive.google.com/file/d/1QCGlfREai1AgkKnrLhdvZm-jt_k50R79/view?usp=sharing I use the speaker PAMR in NUS-48E dataset as target speaker. https://drive.google.com/file/d/19eL1XgAjR4eWTFv7M5jaMJMCWIC17m36/view?usp=sharing The result audio is: https://drive.google.com/file/d/1XsaWrSQ2xtiohbjpm6fFU-V28o4pp2wM/view?usp=sharing

I found that lyrics are hard to hear clearly.

My dataset config: devset: CSD speaker, these three audio en48/en49/en50 were chosed; NUS-48E speakers, ADIZ's 13 and JLEE 05 were chosed. trainset: the other songs in CSD and NUS-48E.

My speaker embedding dimension is 256.( It seems 256 is too large?)

I want to know what could be the problem with my model? And can you share you Decoder model train/dev loss? My Decoder model got a relative larger mel MSE loss on devset than trainset.

wookladin commented 2 years ago

@iehppp2010 Hi. I think your alignment encoder, 'Cotatron' doesn't seem to be working properly. As explained in the paper, we transferred Cotatron from pre-trained weights, which are trained with LibriTTS and VCTK. Did you transfer from those weights? You can find pre-trained weights in this Google Drive link.

iehppp2010 commented 2 years ago

@wookladin Thanks for your quick reply. I do used the pre-trained weights. When I train the 'Decoder' model, the 'Cotatron' aligner model is freezed. I found the plotted alignment is not as good as other TTS model，e.g. Tacotron2.

I want to know if I need to do fine-tune the 'Cotatron' model on singing dataset to get better alignment result? Wish your reply.

wookladin commented 2 years ago

@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality

iehppp2010 commented 2 years ago

@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality

@wookladin Thanks for you quickly reply. After I fine-tune the Cotatron model, the train.loss_reconstruction converges about 0.2, while val.loss_reconstruction got mininum value about 0.5 at step 3893.

I use that checkpiont to train the Decoder model and fine tune HIFI-GAN vocoder.

I found that when test with an audio if the fine-tuned Cotatron model never seen it, I can't get good sample quality. I guess it's the reason that Cotatron model gives not good alignment...

So, I want to know how to let the Cotatron model get better alignment on unseen sing audio? Besides, can you provide more training details?

betty97 commented 2 years ago

@iehppp2010, I am also trying to reproduce the results of this paper. I have one doubt regarding the dataset preparation: how did you split the files? In the paper it is said that "all singing voices are split between 1-12 seconds", did you do it manually for both CSD and NUS-48E, or how? Thanks!!

maum-ai / assem-vc

Controllable and Interpretable Singing Voice Decomposition via Assem-VC #27