Training with a Higher sample rate + steps to take for starting a training session?

hayeong0 / DDDM-VC

Official Pytorch Implementation for "DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion" (AAAI 2024)

https://hayeong0.github.io/DDDM-VC-demo/

160 stars 18 forks source link

Training with a Higher sample rate + steps to take for starting a training session? #7

Closed SoshyHayami closed 3 months ago

SoshyHayami commented 4 months ago

Thanks for the wonderful work you've done.

I looked at the config.json in the ckpt folder, sounds like you trained the original model on 16khz. I know hop and win length must be tweaked if we're going to train on, say, 24khz. just to be on the safe side, can you tell me what exactly should I change and to what values for 24khz? there shouldn't be any problem with the Vocoder as there are hifigans trained on this sample rate.

Also, I'm not sure if I missed this or not, but I'd really appreciate it if you tell me, or even better, updating the readme on how to start the training, I'm a bit confused about the pre-processing steps, dataset format etc.

Anyway, I'm excited to try out this model, and would appreciate your help on these two questions.

hayeong0 commented 4 months ago

Thank you for showing interest in our work.

We trained at 16 kHz due to a fair comparison with other baseline models and computing resource constraints. However, based on our experience, training at 24 kHz can significantly improve the quality of the generated results.

To train at 24 kHz, as you mentioned, we adjust the hop and window sizes. I have used the following settings. If you have a vocoder trained at 24 kHz, you should use a compatible hop size and mel bin.
```
"sampling_rate": 24000,
"filter_length": 960,
"hop_length": 240,
"win_length": 960,
"n_mel_channels": 128, 
```
When extracting F0, we use a resolution that is four times higher than Mel. Therefore, you need to adjust the part that loads the F0 according to the hop size.

We have made the training code public, but I have been delayed in writing README it due to my job commitments.. I will finalize it soon. 😶‍🌫️

SoshyHayami commented 4 months ago

Thanks! then I'll wait until you find the time to update the readme. looking forward to train this model!

hayeong0 commented 3 months ago

@SoshyHayami I have updated the README. I hope it helps with your model training!

SoshyHayami commented 3 months ago

@SoshyHayami I have updated the README. I hope it helps with your model training!

Thank you very much! I'll start my training somewhere around next week, but I have a question if you don't mind answering, mostly about f0. do you think I should re-train the f0-vqvae on my dataset ? my data is Japanese, not English. I wonder if it can generalize well like Vocoders on other languages. and if I have to train it, should I train it on 24khz or can I simply do the pre-processing on a downsampled version of my dataset while using the original 24khz for the training?

And also can you elaborate a bit on When extracting F0, we use a resolution that is four times higher than Mel. Therefore, you need to adjust the part that loads the F0 according to the hop size. ?

Ashigarg123 commented 3 months ago

@SoshyHayami can you please point me to the trained hifigan model for libritts (trained on 24khz)?

SoshyHayami commented 3 months ago

@SoshyHayami can you please point me to the trained hifigan model for libritts (trained on 24khz)?

Sure, you can find it here

btw, if you managed to train this model, I'd appreciate it if you let me know.