Open AK391 opened 3 years ago
Is there any detail about the speaker embedding? Such as What model is used to generate it, whether it is pre-trained, and what data set is used
@AK391 Thanks for your interest! Currently, we don't have a specific plan to release the code of that paper. We will add the link to the paper and demo page at README soon.
@980202006
We just used nn.Embedding
without pre-training. Thanks!
Thank you!
@wookladin
I have tried to reproduce this paper. My train mse mel loss reaches 0.18, while dev mel loss stops at 0.75.
After train the 'Decoder' model, I use this model to do GTA fine finetuning on the HiFi-GAN model you provide. Below is the HiFi-GAN model fine tuning loss
After that, I try to control speaker Identity by just switching the speaker embedding to target speaker, which is the the way you said in the paper. I use a trained audio of CSD female speaker as the reference audio(the link below). https://drive.google.com/file/d/1QCGlfREai1AgkKnrLhdvZm-jt_k50R79/view?usp=sharing I use the speaker PAMR in NUS-48E dataset as target speaker. https://drive.google.com/file/d/19eL1XgAjR4eWTFv7M5jaMJMCWIC17m36/view?usp=sharing The result audio is: https://drive.google.com/file/d/1XsaWrSQ2xtiohbjpm6fFU-V28o4pp2wM/view?usp=sharing
I found that lyrics are hard to hear clearly.
My dataset config: devset: CSD speaker, these three audio en48/en49/en50 were chosed; NUS-48E speakers, ADIZ's 13 and JLEE 05 were chosed. trainset: the other songs in CSD and NUS-48E.
My speaker embedding dimension is 256.( It seems 256 is too large?)
I want to know what could be the problem with my model? And can you share you Decoder model train/dev loss? My Decoder model got a relative larger mel MSE loss on devset than trainset.
@iehppp2010 Hi. I think your alignment encoder, 'Cotatron' doesn't seem to be working properly. As explained in the paper, we transferred Cotatron from pre-trained weights, which are trained with LibriTTS and VCTK. Did you transfer from those weights? You can find pre-trained weights in this Google Drive link.
@wookladin Thanks for your quick reply. I do used the pre-trained weights. When I train the 'Decoder' model, the 'Cotatron' aligner model is freezed. I found the plotted alignment is not as good as other TTS model,e.g. Tacotron2.
I want to know if I need to do fine-tune the 'Cotatron' model on singing dataset to get better alignment result? Wish your reply.
@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality
@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality
@wookladin Thanks for you quickly reply. After I fine-tune the Cotatron model, the train.loss_reconstruction converges about 0.2, while val.loss_reconstruction got mininum value about 0.5 at step 3893.
I use that checkpiont to train the Decoder model and fine tune HIFI-GAN vocoder.
I found that when test with an audio if the fine-tuned Cotatron model never seen it, I can't get good sample quality. I guess it's the reason that Cotatron model gives not good alignment...
So, I want to know how to let the Cotatron model get better alignment on unseen sing audio? Besides, can you provide more training details?
@iehppp2010, I am also trying to reproduce the results of this paper. I have one doubt regarding the dataset preparation: how did you split the files? In the paper it is said that "all singing voices are split between 1-12 seconds", did you do it manually for both CSD and NUS-48E, or how? Thanks!!
just saw this paper https://arxiv.org/abs/2110.12676, when will the repo be updated for this thanks