jik876 / hifi-gan

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
MIT License
1.92k stars 506 forks source link

Question about training time #65

Open max-padgett opened 3 years ago

max-padgett commented 3 years ago

I'm trying to just run train.py with google colab pro (V100 16GB GPU). According to the amount of time it takes per epoch it should take 11ish days (not possible with google colab) to train. Does this sound about right for this model or is there some kind of major bottleneck somewhere? Sorry for the stupid question, I've been trying to speed it up and just want some kind of comparison.

jik876 commented 3 years ago

Thanks for your interest. How many steps do you want to train the model? In our experiments, it took about 2 weeks to train the model up to 2.5M steps with two V100. I would like to suggest 2 ideas. 1) The model can achieve sufficient quality before reaching the target training step. Therefore, it would be a good idea to check the quality repeatedly before reaching the target step. 2) We provide discriminator weights for the universal model. It would be a good idea to conduct transfer learning with the weights.

max-padgett commented 3 years ago

Thank you for replying quickly. I'm doing a project based on modifying hifigan. Unfortunately I'm strapped for time and didn't take into account the training time involved. So two questions.

  1. At how many steps can discernable speech be achieved with the weights.
  2. How large does the dataset actually need to be to achieve this.
jik876 commented 3 years ago

It may vary depending on the difference between the original training data and the training data for transfer learning, so it would be a good idea to check it through experimentation. For reference, in our experiments we found that a model fine-tuned up to 100k steps synthesizes good quality audio. Since this is the case of fine tuning for the same speaker, it may differ from your experimental conditions.