jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.83k stars 493 forks source link

Minimum hours of data required for fine-tuning for a single unseen speaker #54

Open balag59 opened 3 years ago

balag59 commented 3 years ago

Thank you for your amazing work!! For the TTS task, assuming that the synthesizer(Tacotron2) + vocoder has already been trained on a significant number of speakers, what would be the minimum amount of data that would be required to fine-tune the vocoder to a new unseen speaker? Would 5-10 hours be sufficient? Would be helpful to have an approximate amount. Just to add more details, this is for TTS in Hindi and I plan to train Tacotron2 + HifiGAN on ~150 hours of Hindi data with several 100s of speakers before fine-tuning on a new unseen speaker. Thanks!

ghost commented 3 years ago

I have the exact same question.

jik876 commented 3 years ago

@balag59 Thank you. Please understand that we are a bit busy with other work. It will depend on how your dataset is composed and what quality you expect. In our internal experiments, there are cases in which we have concluded that 10 hours dataset is suitable for transfer learning. I would like to point out that this judgment may vary, depending on your dataset and the quality you expect.

balag59 commented 3 years ago

@balag59 Thank you. Please understand that we are a bit busy with other work. It will depend on how your dataset is composed and what quality you expect. In our internal experiments, there are cases in which we have concluded that 10 hours dataset is suitable for transfer learning. I would like to point out that this judgment may vary, depending on your dataset and the quality you expect.

Thank you so much! This is very helpful!

balag59 commented 3 years ago

Hi, I went about fine-tuning on the universal model with ~10 hours of data as you suggested and trained for 100k steps(same as the finetuning experiments in your paper if I'm right). However,when running end to end inference with mels obtained from Tacotron 2, the end result was completely unrelated to the speaker it was fine-tuned on and sounded like a degraded version of the universal model. Any ideas to as what could be causing this issue? The only difference from the paper was that, I do not posses the text for the speech, so I couldn't use teacher forcing with the outputs of Tacotron 2. I ended up using the mels extracted from the ground truth audio itself. Any help on this would be greatly appreciated!Thanks!

thepowerfuldeez commented 3 years ago

@balag59 same here. But I generated mels from tts model. After 50k iterations overall quality of audio increased, but identity of speaker lost. I guess I should train a lot more

thepowerfuldeez commented 3 years ago

image @jik876 I guess it's not normal and I should grab more data?

donand commented 3 years ago

Hi, I went about fine-tuning on the universal model with ~10 hours of data as you suggested and trained for 100k steps(same as the finetuning experiments in your paper if I'm right). However,when running end to end inference with mels obtained from Tacotron 2, the end result was completely unrelated to the speaker it was fine-tuned on and sounded like a degraded version of the universal model. Any ideas to as what could be causing this issue? The only difference from the paper was that, I do not posses the text for the speech, so I couldn't use teacher forcing with the outputs of Tacotron 2. I ended up using the mels extracted from the ground truth audio itself. Any help on this would be greatly appreciated!Thanks!

It cannot work if you don't use teacher forcing, since you need the generated spectrogram to be perfectly aligned with the WAV file. If you don't use teacher forcing the spectrogram will have a different length w.r.t. the audio because of the random factors inside Tacotron 2, and the vocoder cannot learn anything. So if you don't have text your only choice is to train HifiGAN with just the ground-truth spectrograms :(

I'm successfully fine-tuning HifiGAN with just 1.5/2 hours of data, using spectrograms generated by Tacotron 2 with forced alignment. I'm using the universal checkpoint as starting model.

Megh-Thakkar commented 3 years ago

It cannot work if you don't use teacher forcing, since you need the generated spectrogram to be perfectly aligned with the WAV file. I'm successfully fine-tuning HifiGAN with just 1.5/2 hours of data, using spectrograms generated by Tacotron 2 with forced alignment. I'm using the universal checkpoint as starting model.

Can you share a pipeline script for training on limited data? For alignment are you using montreal forced aligner?

Thank you.

v-nhandt21 commented 2 years ago

Hi, I went about fine-tuning on the universal model with ~10 hours of data as you suggested and trained for 100k steps(same as the finetuning experiments in your paper if I'm right). However,when running end to end inference with mels obtained from Tacotron 2, the end result was completely unrelated to the speaker it was fine-tuned on and sounded like a degraded version of the universal model. Any ideas to as what could be causing this issue? The only difference from the paper was that, I do not posses the text for the speech, so I couldn't use teacher forcing with the outputs of Tacotron 2. I ended up using the mels extracted from the ground truth audio itself. Any help on this would be greatly appreciated!Thanks!

It cannot work if you don't use teacher forcing, since you need the generated spectrogram to be perfectly aligned with the WAV file. If you don't use teacher forcing the spectrogram will have a different length w.r.t. the audio because of the random factors inside Tacotron 2, and the vocoder cannot learn anything. So if you don't have text your only choice is to train HifiGAN with just the ground-truth spectrograms :(

I'm successfully fine-tuning HifiGAN with just 1.5/2 hours of data, using spectrograms generated by Tacotron 2 with forced alignment. I'm using the universal checkpoint as starting model.

I also try to use the input from Tacotron2 by teacher forcing, but it seem that them mel from tacotron2 have 1 more frame than the mel generated from audio, Do you have this miss match, Have you config any code compare to the origin in this repo, Thank you to much