jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
665 stars 38 forks source link

how to train the model with Token/s about 23, that is hopsize=1024 #35

Open Liujingxiu23 opened 2 days ago

Liujingxiu23 commented 2 days ago

I try to train the model with hopsize=1024, shout 23 tokens per second, I only change the upsample_rates to [8,8,4,4] and num_samples to 71680. The trainning is running now, but the results seems not, the synthesized wave is not intelligent, not very good. What is a good config?

jishengpeng commented 2 days ago

I try to train the model with hopsize=1024, shout 23 tokens per second, I only change the upsample_rates to [8,8,4,4] and num_samples to 71680. The trainning is running now, but the results seems not, the synthesized wave is not intelligent, not very good. What is a good config?

There are three key considerations to note:

1.The downsampling process should adhere to sampling rate constraints.

2.When modifying the downsampling rate, corresponding adjustments to the hop length and n_fft parameters should be made accordingly.

3.If minimized the number of tokens is your objective, I recommend utilizing audio with a sampling rate of 16 kHz

Liujingxiu23 commented 2 days ago

Thank you for your reply! For the third point, yes I just what to minimized the number of tokens to reduce the computation of the LLM part. You mean "sample_rate=16000 hopsize=600" may be a better choice?

jishengpeng commented 2 days ago

Thank you for your reply! For the third point, yes I just what to minimized the number of tokens to reduce the computation of the LLM part. You mean "sample_rate=16000 hopsize=600" may be a better choice?

There are many options, you can try the following configuration, and then adjust the parameters

such as downsamples=[8,5,4,4], sample_rate=16000, hop_length=640, n_fft=2560

Liujingxiu23 commented 2 days ago

Thanks a lot! I will try to train using this config!