tuanh123789 / Train_Hifigan_XTTS

This is an implementation for train hifigan part of XTTSv2 model using Coqui/TTS.
53 stars 18 forks source link
gpt-2 hifigan xttsv2


This is an implementation for train hifigan part of XTTSv2 model using Coqui/TTS.

In this repo, I'm using the Ljspeech dataset for experimentation, but you can easily swap out different datasets as long as they adhere to the same format as the Ljspeech dataset. Make sure your dataset includes both audio and transcripts.

Another note is that currently, this repository now supports training the HiFi-GAN decoder and the speaker encoder part. If you find this useful, please give me a star. Thank you!

Download dataset and XTTSv2 checkpoint:

Generate GPT Latents

Instead of using melspectrogram like conventional Hifigan, XTTSv2 utilizes GPT latents to convert into waveform format.

Run the script to generate gpt latents to "Ljspeech_latents" folder. You can custom output folder in generate_latents.py.

python generate_latents.py


After generating GPT latents, we will use them to train the model.

python train.py

Since there's no pre-trained discriminator available, and the only option is to load weights from the generator, the generated sounds might be a bit noisy in the early epochs. Please be patient and wait for the later epochs; the results will improve.

For multiple GPU systems the follow will allow multiple gpu training:

CUDA_VISIBLE_DEVICES="0,1" python -m trainer.distribute --script train.py


You can generate audio with new hifigan decoder

python test.py



tensorboard --logdir [output_path]