Open Liujingxiu23 opened 2 days ago
I try to train the model with hopsize=1024, shout 23 tokens per second, I only change the upsample_rates to [8,8,4,4] and num_samples to 71680. The trainning is running now, but the results seems not, the synthesized wave is not intelligent, not very good. What is a good config?
There are three key considerations to note:
1.The downsampling process should adhere to sampling rate constraints.
2.When modifying the downsampling rate, corresponding adjustments to the hop length and n_fft parameters should be made accordingly.
3.If minimized the number of tokens is your objective, I recommend utilizing audio with a sampling rate of 16 kHz
Thank you for your reply! For the third point, yes I just what to minimized the number of tokens to reduce the computation of the LLM part. You mean "sample_rate=16000 hopsize=600" may be a better choice?
Thank you for your reply! For the third point, yes I just what to minimized the number of tokens to reduce the computation of the LLM part. You mean "sample_rate=16000 hopsize=600" may be a better choice?
There are many options, you can try the following configuration, and then adjust the parameters
such as downsamples=[8,5,4,4], sample_rate=16000, hop_length=640, n_fft=2560
Thanks a lot! I will try to train using this config!
I try to train the model with hopsize=1024, shout 23 tokens per second, I only change the upsample_rates to [8,8,4,4] and num_samples to 71680. The trainning is running now, but the results seems not, the synthesized wave is not intelligent, not very good. What is a good config?