modelscope / FunCodec

FunCodec is a research-oriented toolkit for audio quantization and downstream applications, such as text-to-speech synthesis, music generation et.al.
https://funcodec.github.io/
MIT License
371 stars 30 forks source link

Training Funcodec: Data Sources and Recommendations for Starting From Scratch #25

Closed hertz-pj closed 8 months ago

hertz-pj commented 9 months ago

In your Funcodec paper, you mentioned that you used 25k hours of data for training the codec. Does this data include open-source datasets like Gigaspeech and WenetSpeech? If we want to train Funcodec from scratch, do you have any suggestions? Is it better to use more clean data without background noise or more data with noise?

ZhihaoDU commented 9 months ago
  1. Does this data include open-source datasets like Gigaspeech and WenetSpeech? Yes, but, we filtered out some too noisy data from the open-source datasets.

  2. If we want to train Funcodec from scratch, do you have any suggestions? There aren't magic tricks, one notable thing is using clean data as much as possible. You need pay attention to the update times of discriminator, it should be larger than 30 per 50 generator updates. If it is too small, you may reduce the value of feat_match_loss_weight. You can also reconstruct some waveforms with the intermediate checkpoints to evaluate the training process.

  3. Is it better to use more clean data without background noise or more data with noise? It depends on your down-stream tasks. For TTS-related tasks, I recommend to use more clean data without background noise . For ASR-related tasks, using more data may lead better performance. But the most important thing is the convergence, if the data is too noisy, the model can not converge, in this case, please filter out the data.

hertz-pj commented 7 months ago

@ZhihaoDU Can you recommend some methods for data filtering?