hayeong0 / DDDM-VC

Official Pytorch Implementation for "DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion" (AAAI 2024)
https://hayeong0.github.io/DDDM-VC-demo/
160 stars 18 forks source link

Training from scratch? #12

Open SoshyHayami opened 2 months ago

SoshyHayami commented 2 months ago
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/DDDM-VC/train_dddmvc.py", line 271, in <module>
    main()
  File "/home/ubuntu/DDDM-VC/train_dddmvc.py", line 42, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/ubuntu/DDDM-VC/train_dddmvc.py", line 85, in run
    utils.load_checkpoint(path_ckpt, net_v, None)
  File "/home/ubuntu/DDDM-VC/utils.py", line 21, in load_checkpoint
    iteration = checkpoint_dict['iteration']
KeyError: 'iteration'

getting this error, sounds like your code throws an error if I don't feed it your base model, is that right?

the thing is, I don't want to fine tune your base model. I need to train it from scratch.

there's also a couple of mistakes in your code, like from hifigan.vocoder import HiFi is an invalid path so should be from vocoder.hifigan import HiFi + the path to your hifigan, dependencies etc.

So far, I've not been able to start a training session, but I hope it works on the regular hifigan checkpoints provided by its authors.

Ashigarg123 commented 2 months ago

Hello @SoshyHayami, as per my understanding you will need the checkpoint of hifigan and vqvae but you don't necessarily need to give the checkpoint path of already trained model I think. Are you sure your paths to hifigan and vqvae are correct? Also, if you are loading a different hifigan checkpoint, could you show me which one? You might need to make changes to the load_checkpoint function (try printing the keys of the checkpoint dictionary, if 'iteration' exists.)

SoshyHayami commented 2 months ago

Hello @SoshyHayami, as per my understanding you will need the checkpoint of hifigan and vqvae but you don't necessarily need to give the checkpoint path of already trained model I think. Are you sure your paths to hifigan and vqvae are correct? Also, if you are loading a different hifigan checkpoint, could you show me which one? You might need to make changes to the load_checkpoint function (try printing the keys of the checkpoint dictionary, if 'iteration' exists.)

Yeah Upon further investigation, I realized it has something to do with what you said. I used the same ckpt as I told you in the other thread, i think it was trained using the same config as the original hifi gan repo albeit with minor tweaks to make it 24khz, I used the same config to re-train the hifi-gan and it kinda worked there.

should i simply manipulate the keys inside the hifigan's ckpt? I pretty much rather not do that since it becomes impossible for me to debug it given how many things I've already touched to configure it for 24khz.

I hope the author releases their 24khz config and vocoder if they have one.

SoshyHayami commented 2 months ago

I looked at the keys, the author's vocoder have these:

model
iteration
optimizer
learning_rate

while mine is just a single generator. so I think the voc_ckpt the author used is more than just a vocoder. changing vocoders doesn't seem to be a trivial thing anymore, I may as well give up on this. ain't No way I'm gonna train a 16khz model.

hayeong0 commented 2 months ago

Hello,

In the train code, loading the vocoder is simply for checking the audio through validation. It's okay to remove this part during training.

If you trained this model with the same Mel settings as your desired 24 kHz vocoder, you can use it for validation and inference. It seems that parts such as optimizer and iteration have been excluded, but you can load the pre-trained vocoder in the following way, so please try it.


from vocoder.models import Generator 

model = Generator(hps).cuda()
state_dict = load_checkpoint("your vocoder path", device="cuda")
model.load_state_dict(state_dict['generator'])
model.eval()
model.remove_weight_norm()
SoshyHayami commented 2 months ago

Hello,

In the train code, loading the vocoder is simply for checking the audio through validation. It's okay to remove this part during training.

If you trained this model with the same Mel settings as your desired 24 kHz vocoder, you can use it for validation and inference. It seems that parts such as optimizer and iteration have been excluded, but you can load the pre-trained vocoder in the following way, so please try it.

from vocoder.models import Generator 

model = Generator(hps).cuda()
state_dict = load_checkpoint("your vocoder path", device="cuda")
model.load_state_dict(state_dict['generator'])
model.eval()
model.remove_weight_norm()

Sure, thanks. I'll try that. in the meantime while you're here, may I ask you to consider making a branch for the 24khz version that you mentioned you've trained yourself? It's much better trying to reproduce that if possible. Thank you!

SoshyHayami commented 2 months ago

after days of trying, I did pre-processing a few times, even 16khz. I always end up with:

" f0_start = np.random.randint(0, max_f0_start) File "mtrand.pyx", line 747, in numpy.random.mtrand.RandomState.randint File "_bounded_integers.pyx", line 1254, in numpy.random._bounded_integers._rand_int64 ValueError: low >= high

Well guys, time to give up.

hayeong0 commented 2 months ago

Hello, @SoshyHayami

It is a NumPy error. It seems to be an error related to an invalid range passed to the numpy.random.randint function. If the max_f0_start value has become negative, it indicates that an incorrectly calculated value is being used.

It appears that the error occurs at this part of the line: https://github.com/hayeong0/DDDM-VC/blob/9c1e57c621cc873ccf39785b61d01ea8cba6de75/data_loader.py#L50-L52

In this code, the value divided by 80 is used as the starting value for F0 because we use F0 with a resolution that is 4 times higher than that of the Mel scale. This adjustment ensures that the segment size matches. For the Mel scale, we use a hop size of 320. Please adjust this part of the code accordingly.

Thanks.

SoshyHayami commented 2 months ago

@hayeong0

Hi, I understand it's because of the invalid range given to the randint. I used the exact same pre-processing steps on a 16khz audio dataset, aggressively cutting the silent parts. but since I get the "Invalid Value Encountered in Divide" mentioned in another issue, my best guess is something in my data doesn't quite work with the yaapt algorithm and thus resulting in a wrong f0 calculation. I used the same dataset on other models and they usually work great, so I can't pinpoint the exact place it messes it up.

btw, is the number 80 and 1280 related to number of mel bins and window size respetively? the vocoder I was going to use for the 24khz version uses this config :

{
    "resblock": "1",
    "num_gpus": 0,
    "batch_size": 20,
    "learning_rate": 0.00005,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999,
    "seed": 1234,

    "upsample_rates": [10,5,3,2],
    "upsample_kernel_sizes": [20,10,6,4],
    "upsample_initial_channel": 512,
    "resblock_kernel_sizes": [3,7,11],
    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],

    "segment_size": 57600,
    "num_mels": 80,
    "num_freq": 1025,
    "n_fft": 2048,
    "hop_size": 300,
    "win_size": 1200,

    "sampling_rate": 24000,

    "fmin": 0,
    "fmax": 8000,
    "fmax_for_loss": null,

    "num_workers": 4,

    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321",
        "world_size": 1
    }
}

so I assumed all instances of 1280, whether in the utils or the extractions script must be changed to 1200. (I should re-emphasize that the main issue I was facing above regarding the f0 calculations are happening even with the unmodified code on a 16khz dataset, using the checkpoints you provided. so what I'm saying is unrelated to that here.)

markrmiller commented 1 month ago

@SoshyHayami, for what it's worth, I trained the model in 16K on 500-700 thousand wav files using Crepe for the pitch embeddings and had no issues (without any attention to silence issues, very diverse dataset)

I'm currently trying to do a 24K model, but I'm still fighting to get everything aligned (I'm trying to do it without extracting new crepe pitch embeddings as it will take me almost 2 days even with two 3090s on it.)

What did you do with wav2vec? Just keep the same model and give it 16k input?

@hayeong0 Thanks for sharing this awesome model by the way! I've been so disappointed with all of the publicly available voice conversion diffusion models. They pretty much all use the same core strategy and code, which invariably means they train on source voice to source voice, and regardless of the other innovations and tricks, the result is always poor.

Your Mixin strategy is brilliant! I don't fully understand why it works so well and that it seems to take way more positive style transfer than negative content transfer from it, but it does work so well! I have not had a chance to experiment much with my results yet, but from a couple of experiments, zero shot with over 4,000 speakers looked pretty darn impressive! It also nailed a couple of unique voices in the training data that always elude these other models.