Open bensonbs opened 5 months ago
If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.
Additionally, I want to ask why some losses increase during training. Should I choose the last checkpoint or the best checkpoint?
Remeber that we use GAN loss so Dis loss and Gen loss are "opposite optimization". Best model is choosed from evaluate test loss so you can choose it
If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.
hi, how about multi speakers? is 30 minutes enough?
If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.
hi, how about multi speakers? is 30 minutes enough?
I'm trying to training from scratch, I'll inform you when I succeed
If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.
hi, how about multi speakers? is 30 minutes enough?
I'm trying to training from scratch, I'll inform you when I succeed
thanks, i've used 2000 samples, it works (but not very high quality) after finetune
If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.
hi, how about multi speakers? is 30 minutes enough?
I'm trying to training from scratch, I'll inform you when I succeed
thanks, i've used 2000 samples, it works (but not very high quality) after finetune
Is this for multi speakers? Could you provide examples of the generated voices?
My XTTS-v2 model produces a cracking sound when generating higher-pitched female voices after being fine-tuned.
Therefore, I attempted to fine-tune HiFi-GAN using a dataset of approximately 10,000 multi-speaker samples (the same dataset used for fine-tuning the XTTS-v2 model), but it was unsuccessful.
If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.
hi, how about multi speakers? is 30 minutes enough?
I'm trying to training from scratch, I'll inform you when I succeed
thanks, i've used 2000 samples, it works (but not very high quality) after finetune
Is this for multi speakers? Could you provide examples of the generated voices?
My XTTS-v2 model produces a cracking sound when generating higher-pitched female voices after being fine-tuned.
Therefore, I attempted to fine-tune HiFi-GAN using a dataset of approximately 10,000 multi-speaker samples (the same dataset used for fine-tuning the XTTS-v2 model), but it was unsuccessful.
Do you train on new language ? I add Vietnamese for GPT and use that dataset for Hifigan too and It's working
If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.
hi, how about multi speakers? is 30 minutes enough?
I'm trying to training from scratch, I'll inform you when I succeed
thanks, i've used 2000 samples, it works (but not very high quality) after finetune
More data will improve voice quality
I fine-tune in Chinese data. If I train for a longer period, will it improve?
GT audio: https://mork.ro/DqTvY
Test audio: https://mork.ro/D2hZO
I found that the audio files generated in the "synthesis" folder are already unrecognizable in terms of the sentences' content. Is this normal?
If you use pretrained. The generated audio should match the content in sentences.
Whether using a pre-trained or fine-tuned xtts model, the differences between the audio in the 'wav' and 'synthesis' folders in the Chinese dataset are greater than in the English dataset, to the extent that the content becomes unintelligible.
This not happen with Vietnamese. Synthesis and Raw Wav are almost the same. Can you send me some synthesis file and corresponding raw audio
Raw wav: https://mork.ro/JdAyO#
synthesis: https://mork.ro/cJ0ad#
i don't know chinese so do the synthesis audio i read correctly with the sentence
I feel that the model is indeed trying to synthesize the same sentences, but the pronunciation is not standard, making the content unrecognizable.
I feel that the model is indeed trying to synthesize the same sentences, but the pronunciation is not standard, making the content unrecognizable.
So i think the promblem in the GPT part of yours XTTS model. Because with my language the synthesis audio the pronunciation is true
I used the pre-trained model self.xtts_checkpoint = "XTTS-v2/model.pth"
to test if the issue lies within the GPT part. I commented out the following two lines in test.py
:
# self.hifigan_generator = self.load_hifigan_generator() <----
# # model.hifigan_decoder.waveform_decoder = self.hifigan_generator <----
This ensures that xtts uses the pre-trained HiFi-GAN.
I used wav/094.wav
as the speaker_reference
and generated the same text content as in metadata.csv
for 094.wav
. Then, I compared the output with synthesis/094.wav
generated by generate_latents.py
.
In theory, both generate_latents.py
and test.py
(with the HiFi-GAN loading lines commented out) should produce the same results since they both use the pre-trained xtts and pre-trained HiFi-GAN.
However, I noticed that the synthesis/094.wav
generated by generate_latents.py
has mispronunciations, while the xtts_finetune_hifigan.wav
generated by test.py
(with the HiFi-GAN loading lines commented out) sounds correct.
Is my approach to this test correct?
I used the pre-trained model
self.xtts_checkpoint = "XTTS-v2/model.pth"
to test if the issue lies within the GPT part. I commented out the following two lines intest.py
:# self.hifigan_generator = self.load_hifigan_generator() <---- # # model.hifigan_decoder.waveform_decoder = self.hifigan_generator <----
This ensures that xtts uses the pre-trained HiFi-GAN.
I used
wav/094.wav
as thespeaker_reference
and generated the same text content as inmetadata.csv
for094.wav
. Then, I compared the output withsynthesis/094.wav
generated bygenerate_latents.py
.In theory, both
generate_latents.py
andtest.py
(with the HiFi-GAN loading lines commented out) should produce the same results since they both use the pre-trained xtts and pre-trained HiFi-GAN.However, I noticed that the
synthesis/094.wav
generated bygenerate_latents.py
has mispronunciations, while thextts_finetune_hifigan.wav
generated bytest.py
(with the HiFi-GAN loading lines commented out) sounds correct.Is my approach to this test correct?
No, the test.py use inference funtion of GPT while generate_latents.py use forward. So output are not the same
Could you explain in detail the difference between Inference and Forward? Additionally, could you speculate on what might be causing the issue? Thanks!
maybe you could check if you are actually using the pretrained hifi-gan weights.
notice this: self.model.model_g.load_state_dict(hifigan_state_dict, strict=False)
, make sure you are loading the pretrained weights.
(otherwise, you are training from scratch)
May I ask if this requires a lot of epoch runs to find a clear sound, even if it's fine-tuning?
Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model 2、Fine-tuning the GPT-2 model
Am I right?
Xin chào bạn, @tuanh123789 Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào? Nếu muốn traning model XTTS tiếng việt có được hay không?
Xin chào bạn, @tuanh123789 Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào? Nếu muốn traning model XTTS tiếng việt có được hay không?
LJspeech là bộ data tiếng Anh. Muốn train model XTTS tiếng việt được bạn nhé. Với điều kiện bạn phải có phần XTTS GPT train với tiếng Việt để có thể dùng sinh ra data train XTTS Hifigan
Xin chào bạn, @tuanh123789 Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào? Nếu muốn traning model XTTS tiếng việt có được hay không?
LJspeech là bộ data tiếng Anh. Muốn train model XTTS tiếng việt được bạn nhé. Với điều kiện bạn phải có phần XTTS GPT train với tiếng Việt để có thể dùng sinh ra data train XTTS Hifigan
Bạn có thể nói rõ hơn về XTTS GPT train được không? Thanks
Model này chia làm 3 stage: train VAE, GPT, HIFIGAN. Repo này dùng để training HIFIGAN cho XTTS. Để train cho tiếng Việt bạn cần train stage 2: GPT trước với tiếng Việt. Script này thì mình chưa public
Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model 2、Fine-tuning the GPT-2 model 3. End-to-end fine-tuning of the whole system using Hi-Fi GAN.
Am I right?
I would also love to know the answer to this :)
Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model 2、Fine-tuning the GPT-2 model 3. End-to-end fine-tuning of the whole system using Hi-Fi GAN. Am I right?
I would also love to know the answer to this :)
The HiFi GAN is seperate part not End-to-End. Output from GPT part will use to train Hifigan. Finetune Dvae is not necessary
Additionally, I want to ask why some losses increase during training. Should I choose the last checkpoint or the best checkpoint?