How many hours of audio are needed to train a model?

bensonbs commented 5 months ago

Additionally, I want to ask why some losses increase during training. Should I choose the last checkpoint or the best checkpoint?

tuanh123789 commented 5 months ago

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

tuanh123789 commented 5 months ago

Additionally, I want to ask why some losses increase during training. Should I choose the last checkpoint or the best checkpoint?

Remeber that we use GAN loss so Dis loss and Gen loss are "opposite optimization". Best model is choosed from evaluate test loss so you can choose it

hscspring commented 5 months ago

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

tuanh123789 commented 5 months ago

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

hscspring commented 5 months ago

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

thanks, i've used 2000 samples, it works (but not very high quality) after finetune

bensonbs commented 5 months ago

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

thanks, i've used 2000 samples, it works (but not very high quality) after finetune

Is this for multi speakers? Could you provide examples of the generated voices?

My XTTS-v2 model produces a cracking sound when generating higher-pitched female voices after being fine-tuned.

Therefore, I attempted to fine-tune HiFi-GAN using a dataset of approximately 10,000 multi-speaker samples (the same dataset used for fine-tuning the XTTS-v2 model), but it was unsuccessful.

tuanh123789 commented 5 months ago

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

thanks, i've used 2000 samples, it works (but not very high quality) after finetune

Is this for multi speakers? Could you provide examples of the generated voices?

My XTTS-v2 model produces a cracking sound when generating higher-pitched female voices after being fine-tuned.

Therefore, I attempted to fine-tune HiFi-GAN using a dataset of approximately 10,000 multi-speaker samples (the same dataset used for fine-tuning the XTTS-v2 model), but it was unsuccessful.

Do you train on new language ? I add Vietnamese for GPT and use that dataset for Hifigan too and It's working

tuanh123789 commented 5 months ago

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

thanks, i've used 2000 samples, it works (but not very high quality) after finetune

More data will improve voice quality

bensonbs commented 5 months ago

I fine-tune in Chinese data. If I train for a longer period, will it improve?

GT audio: https://mork.ro/DqTvY

Test audio: https://mork.ro/D2hZO

bensonbs commented 5 months ago

I found that the audio files generated in the "synthesis" folder are already unrecognizable in terms of the sentences' content. Is this normal?

tuanh123789 commented 5 months ago

If you use pretrained. The generated audio should match the content in sentences.

bensonbs commented 5 months ago

Whether using a pre-trained or fine-tuned xtts model, the differences between the audio in the 'wav' and 'synthesis' folders in the Chinese dataset are greater than in the English dataset, to the extent that the content becomes unintelligible.

tuanh123789 commented 5 months ago

This not happen with Vietnamese. Synthesis and Raw Wav are almost the same. Can you send me some synthesis file and corresponding raw audio

bensonbs commented 5 months ago

Raw wav: https://mork.ro/JdAyO#

synthesis: https://mork.ro/cJ0ad#

tuanh123789 commented 5 months ago

i don't know chinese so do the synthesis audio i read correctly with the sentence

bensonbs commented 5 months ago

I feel that the model is indeed trying to synthesize the same sentences, but the pronunciation is not standard, making the content unrecognizable.

tuanh123789 commented 5 months ago

I feel that the model is indeed trying to synthesize the same sentences, but the pronunciation is not standard, making the content unrecognizable.

So i think the promblem in the GPT part of yours XTTS model. Because with my language the synthesis audio the pronunciation is true

bensonbs commented 5 months ago

I used the pre-trained model self.xtts_checkpoint = "XTTS-v2/model.pth" to test if the issue lies within the GPT part. I commented out the following two lines in test.py:

# self.hifigan_generator = self.load_hifigan_generator() <----
# # model.hifigan_decoder.waveform_decoder = self.hifigan_generator <----

This ensures that xtts uses the pre-trained HiFi-GAN.

I used wav/094.wav as the speaker_reference and generated the same text content as in metadata.csv for 094.wav. Then, I compared the output with synthesis/094.wav generated by generate_latents.py.

In theory, both generate_latents.py and test.py (with the HiFi-GAN loading lines commented out) should produce the same results since they both use the pre-trained xtts and pre-trained HiFi-GAN.

However, I noticed that the synthesis/094.wav generated by generate_latents.py has mispronunciations, while the xtts_finetune_hifigan.wav generated by test.py (with the HiFi-GAN loading lines commented out) sounds correct.

Is my approach to this test correct?

tuanh123789 commented 5 months ago

I used the pre-trained model self.xtts_checkpoint = "XTTS-v2/model.pth" to test if the issue lies within the GPT part. I commented out the following two lines in test.py:
# self.hifigan_generator = self.load_hifigan_generator() <----
# # model.hifigan_decoder.waveform_decoder = self.hifigan_generator <----
This ensures that xtts uses the pre-trained HiFi-GAN.

I used wav/094.wav as the speaker_reference and generated the same text content as in metadata.csv for 094.wav. Then, I compared the output with synthesis/094.wav generated by generate_latents.py.

In theory, both generate_latents.py and test.py (with the HiFi-GAN loading lines commented out) should produce the same results since they both use the pre-trained xtts and pre-trained HiFi-GAN.

However, I noticed that the synthesis/094.wav generated by generate_latents.py has mispronunciations, while the xtts_finetune_hifigan.wav generated by test.py (with the HiFi-GAN loading lines commented out) sounds correct.

Is my approach to this test correct?

No, the test.py use inference funtion of GPT while generate_latents.py use forward. So output are not the same

bensonbs commented 5 months ago

Could you explain in detail the difference between Inference and Forward? Additionally, could you speculate on what might be causing the issue? Thanks!

hscspring commented 5 months ago

maybe you could check if you are actually using the pretrained hifi-gan weights. notice this: self.model.model_g.load_state_dict(hifigan_state_dict, strict=False), make sure you are loading the pretrained weights. (otherwise, you are training from scratch)

ScottishFold007 commented 5 months ago

May I ask if this requires a lot of epoch runs to find a clear sound, even if it's fine-tuning?

ScottishFold007 commented 5 months ago

Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model 2、Fine-tuning the GPT-2 model

End-to-end fine-tuning of the whole system using Hi-Fi GAN.

Am I right?

vcstack commented 5 months ago

Xin chào bạn, @tuanh123789 Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào? Nếu muốn traning model XTTS tiếng việt có được hay không?

tuanh123789 commented 5 months ago

Xin chào bạn, @tuanh123789 Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào? Nếu muốn traning model XTTS tiếng việt có được hay không?

LJspeech là bộ data tiếng Anh. Muốn train model XTTS tiếng việt được bạn nhé. Với điều kiện bạn phải có phần XTTS GPT train với tiếng Việt để có thể dùng sinh ra data train XTTS Hifigan

vcstack commented 5 months ago

Xin chào bạn, @tuanh123789 Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào? Nếu muốn traning model XTTS tiếng việt có được hay không?

LJspeech là bộ data tiếng Anh. Muốn train model XTTS tiếng việt được bạn nhé. Với điều kiện bạn phải có phần XTTS GPT train với tiếng Việt để có thể dùng sinh ra data train XTTS Hifigan

Bạn có thể nói rõ hơn về XTTS GPT train được không? Thanks

tuanh123789 commented 5 months ago

Model này chia làm 3 stage: train VAE, GPT, HIFIGAN. Repo này dùng để training HIFIGAN cho XTTS. Để train cho tiếng Việt bạn cần train stage 2: GPT trước với tiếng Việt. Script này thì mình chưa public

C00reNUT commented 2 months ago

Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model 2、Fine-tuning the GPT-2 model 3. End-to-end fine-tuning of the whole system using Hi-Fi GAN.

Am I right?

I would also love to know the answer to this :)

tuanh123789 commented 2 months ago

Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model 2、Fine-tuning the GPT-2 model 3. End-to-end fine-tuning of the whole system using Hi-Fi GAN. Am I right?

I would also love to know the answer to this :)

The HiFi GAN is seperate part not End-to-End. Output from GPT part will use to train Hifigan. Finetune Dvae is not necessary

tuanh123789 / Train_Hifigan_XTTS

How many hours of audio are needed to train a model? #2