How many steps should we train to get the best results?

jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

https://jaywalnut310.github.io/vits-demo/index.html

MIT License

6.73k stars 1.24k forks source link

How many steps should we train to get the best results? #87

Open futureaiengineeer opened 2 years ago

futureaiengineeer commented 2 years ago

I train my custom 10-hour 44,1Khz dataset till 400k steps, but the models seems not to synthesize good results. I wonder how many steps should i train the model to get the best result?

ZJ-CAI commented 1 year ago

10 hours seems to be a little too long. I used 2-hour one-speaker dataset and get a good result with 270K step(batch size 16).

Maybe the quality of your dataset is poor, or something else is wrong.

athenasaurav commented 1 year ago

Hello Everyone,

I'm training a VCTK dataset (22050 sampling rate), downloaded, for the multi-Speaker model. I have trained for 350000 steps and yet the quality of synthesis is not good as pre-trained models in the repo. How many steps will get a similar result?

Dataset was resampled by me from 48000 to 22050.

Dataset Source : https://www.kaggle.com/datasets/showmik50/vctk-dataset

athenasaurav commented 1 year ago

One update, I noticed that in my dataset, there is initial silence in most of the audio files (before getting downsampled), so it remains in 22050Hz data as well.

I have used set_frame_rate function from pydub import AudioSegment to downsample the audio files only but didn't use librosa to trim silence. Is it necessary to trim silence at the start and end of every file for better results?

nikich340 commented 1 year ago

Authors used 300k steps with batch = 64, start from that.

LanglyAdrian commented 1 year ago

@athenasaurav, did you end up having to remove the silence? I got to 100k steps and when generating, I get only silence. I thought maybe the problem is also that I didn’t cut off the silence in the dataset.

athenasaurav commented 1 year ago

@LanglyAdrian yes silence do create some issue but i started getting better results after more epochs.

LanglyAdrian commented 1 year ago

@athenasaurav can you look this ? I already doubt that the problem is in the presence of silence. After 130k, there should be at least some sounds, but here it's just silence.

athenasaurav commented 1 year ago

@LanglyAdrian not sure what your problem can be. Can you share your inference code. Also you are passing the speaker ids as per VCTK data in filelist?

LanglyAdrian commented 1 year ago

@athenasaurav , For inference, I use the code from the colab, but with my weights.

I'm using the original filelists and I've checked that all the wavs are in the correct places and match the information in the filelists.

Linghuxc commented 1 year ago

Authors used 300k steps with batch = 64, start from that. @nikich340 Authors used 300k steps with batch = 64, start from that. My batch_size=8; Do I need to train 2400k steps to get the results? I can only train 3k steps in an hour now, because an epoch takes about 5 minutes. That would take me about 800 hours to train, right? Is that reasonable? I am looking forward to your reply.Thank you very much!

athenasaurav commented 1 year ago

@Linghuxc

It doesnt work like that, i have trained for around 350k steps with a batch size of 16 and got good quality. You can do the same with batch size 8. Batch size basically dependent on you GPU size.

Linghuxc commented 1 year ago

@athenasaurav

Ok, I see what you mean. Thank you very much for your answer!