Open futureaiengineeer opened 2 years ago
10 hours seems to be a little too long. I used 2-hour one-speaker dataset and get a good result with 270K step(batch size 16).
Maybe the quality of your dataset is poor, or something else is wrong.
Hello Everyone,
I'm training a VCTK dataset (22050 sampling rate), downloaded, for the multi-Speaker model. I have trained for 350000 steps and yet the quality of synthesis is not good as pre-trained models in the repo. How many steps will get a similar result?
Dataset was resampled by me from 48000 to 22050.
Dataset Source : https://www.kaggle.com/datasets/showmik50/vctk-dataset
One update, I noticed that in my dataset, there is initial silence in most of the audio files (before getting downsampled), so it remains in 22050Hz data as well.
I have used set_frame_rate function from pydub import AudioSegment to downsample the audio files only but didn't use librosa to trim silence. Is it necessary to trim silence at the start and end of every file for better results?
Authors used 300k steps with batch = 64, start from that.
@athenasaurav, did you end up having to remove the silence? I got to 100k steps and when generating, I get only silence. I thought maybe the problem is also that I didn’t cut off the silence in the dataset.
@LanglyAdrian yes silence do create some issue but i started getting better results after more epochs.
@athenasaurav can you look this ? I already doubt that the problem is in the presence of silence. After 130k, there should be at least some sounds, but here it's just silence.
@LanglyAdrian not sure what your problem can be. Can you share your inference code. Also you are passing the speaker ids as per VCTK data in filelist?
@athenasaurav , For inference, I use the code from the colab, but with my weights.
I'm using the original filelists and I've checked that all the wavs are in the correct places and match the information in the filelists.
Authors used 300k steps with batch = 64, start from that. @nikich340 Authors used 300k steps with batch = 64, start from that. My batch_size=8; Do I need to train 2400k steps to get the results? I can only train 3k steps in an hour now, because an epoch takes about 5 minutes. That would take me about 800 hours to train, right? Is that reasonable? I am looking forward to your reply.Thank you very much!
@Linghuxc
It doesnt work like that, i have trained for around 350k steps with a batch size of 16 and got good quality. You can do the same with batch size 8. Batch size basically dependent on you GPU size.
@athenasaurav
Ok, I see what you mean. Thank you very much for your answer!
I train my custom 10-hour 44,1Khz dataset till 400k steps, but the models seems not to synthesize good results. I wonder how many steps should i train the model to get the best result?