openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.62k stars 275 forks source link

How to use pretrained models? #109

Closed med1844 closed 1 year ago

med1844 commented 1 year ago

Attempt on WSL2

I'm using python 3.10.6, CUDA 11.8, torch 2.0.1, Ubuntu 22.04.2 LTS on Windows 11 x86_64. I'm using code in the refactor branch and trying to use pretrained models listed in the release section (I tried most of them and none of them works). When I try to finetune on a custom dataset, it reports the following error:

| model Trainable Parameters: 66.557M
Traceback (most recent call last):
  File "run.py", line 15, in <module>
    run_task()
  File "run.py", line 11, in run_task
    task_cls.start()
  File "/home/med/projects/svs/DiffSinger/basics/base_task.py", line 242, in start
    trainer.fit(task)
...
  File "/home/med/projects/svs/DiffSinger/utils/pl_utils.py", line 712, in restore_training_state
    optimizer.load_state_dict(opt_state)
  File "/home/med/.conda/envs/diff/lib/python3.8/site-packages/torch/optim/optimizer.py", line 390, in load_state_dict
    raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

NOTE: I have added some print statements in pl_utils.py for more information. Thus, the line information might differ from the original code.

Here's what I have done so far:

  1. use pipelines/no_midi_preparation.ipynb to generate config.
  2. run $ CUDA_VISIBLE_DEVICES=0 python run.py --config data/name/config.yaml --exp_name 0703_name_ds1000 and immediately kill it once the tqdm bar of the training process shows up. This helps me create an experiment folder with all information required in the checkpoints/ folder.
  3. copy pretrained .ckpt file, such as model_ckpt_steps_360000.ckpt, into that newly created experiment folder.
  4. rerun that training command. Then I got the error shown above.

The debugger shows the following information:

[13] > /home/med/.local/lib/python3.10/site-packages/torch/optim/optimizer.py(390)load_state_dict()
-> raise ValueError("loaded state dict contains a parameter group "
(Pdb++) bt
[0]   /usr/lib/python3.10/pdb.py(1726)main()
-> pdb._runscript(mainpyfile)
[1]   /usr/lib/python3.10/pdb.py(1586)_runscript()
-> self.run(statement)
[2]   /usr/lib/python3.10/bdb.py(597)run()
-> exec(cmd, globals, locals)
[3]   <string>(1)<module>()
[4]   /home/med/projects/svs/DiffSinger/run.py(15)<module>()
-> run_task()
[5]   /home/med/projects/svs/DiffSinger/run.py(11)run_task()
-> task_cls.start()
[6]   /home/med/projects/svs/DiffSinger/basics/base_task.py(242)start()
-> trainer.fit(task)
[7]   /home/med/projects/svs/DiffSinger/utils/pl_utils.py(497)fit()
-> self.run_pretrain_routine(model)
[8]   /home/med/projects/svs/DiffSinger/utils/pl_utils.py(549)run_pretrain_routine()
-> self.restore_weights(model)
[9]   /home/med/projects/svs/DiffSinger/utils/pl_utils.py(625)restore_weights()
-> self.restore_state_if_checkpoint_exists(model)
[10]   /home/med/projects/svs/DiffSinger/utils/pl_utils.py(663)restore_state_if_checkpoint_exists()
-> self.restore(last_ckpt_path, self.on_gpu)
[11]   /home/med/projects/svs/DiffSinger/utils/pl_utils.py(680)restore()
-> self.restore_training_state(checkpoint)
[12]   /home/med/projects/svs/DiffSinger/utils/pl_utils.py(712)restore_training_state()
-> optimizer.load_state_dict(opt_state)
[13] > /home/med/.local/lib/python3.10/site-packages/torch/optim/optimizer.py(390)load_state_dict()
-> raise ValueError("loaded state dict contains a parameter group "
(Pdb++) ll
 371         def load_state_dict(self, state_dict):
 372             r"""Loads the optimizer state.
 373
 374             Args:
 375                 state_dict (dict): optimizer state. Should be an object returned
 376                     from a call to :meth:`state_dict`.
 377             """
 378             # deepcopy, to be consistent with module API
 379             state_dict = deepcopy(state_dict)
 380             # Validate the state_dict
 381             groups = self.param_groups
 382             saved_groups = state_dict['param_groups']
 383
 384             if len(groups) != len(saved_groups):
 385                 raise ValueError("loaded state dict has a different number of "
 386                                  "parameter groups")
 387             param_lens = (len(g['params']) for g in groups)
 388             saved_lens = (len(g['params']) for g in saved_groups)
 389             if any(p_len != s_len for p_len, s_len in zip(param_lens, saved_lens)):
 390  ->             raise ValueError("loaded state dict contains a parameter group "
 391                                  "that doesn't match the size of optimizer's group")
 392
 393             # Update the state
 394             id_map = {old_id: p for old_id, p in
 395                       zip(chain.from_iterable((g['params'] for g in saved_groups)),
 396                           chain.from_iterable((g['params'] for g in groups)))}

It seems that it's caused by difference in "params". Here're information I got from pdb:

(Pdb++) len(groups), len(saved_groups)
(1, 1)
(Pdb++) type(groups[0]["params"]), type(saved_groups[0]["params"])
(<class 'list'>, <class 'list'>)
(Pdb++) len(groups[0]["params"]), len(saved_groups[0]["params"])
(217, 219)
(Pdb++) type(groups[0]["params"][0]), type(saved_groups[0]["params"][0])
(<class 'torch.nn.parameter.Parameter'>, <class 'int'>)
(Pdb++) groups[0]["params"][0], saved_groups[0]["params"][0]
(Parameter containing:
tensor([[[ 0.0634],
         [ 0.0418],
         [-0.0560],
         ...,
         [-0.0589],
         [ 0.0567],
         [-0.0958]],

        [[ 0.0058],
         [ 0.2732],
         [-0.0860],
         ...,
         [ 0.0154],
         [ 0.0657],
         [-0.0192]],

        [[-0.0377],
         [-0.0606],
         [ 0.1654],
         ...,
         [-0.0720],
         [-0.0208],
         [ 0.1448]],

        ...,

        [[ 0.0625],
         [-0.0472],
         [ 0.0012],
         ...,
         [ 0.0243],
         [ 0.1030],
         [-0.0037]],

        [[-0.0535],
         [ 0.0042],
         [ 0.0161],
         ...,
         [-0.0007],
         [ 0.0017],
         [ 0.0334]],

        [[ 0.0055],
         [-0.0117],
         [ 0.0081],
         ...,
         [ 0.0122],
         [ 0.0016],
         [-0.0213]]], device='cuda:0', requires_grad=True), 0)
(Pdb++) saved_groups[0]["params"]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218]

It seems that the saved params are very different from what just got initialized in the new AdamW instance.

Attempt on Docker container

I would like to fully eliminate possible issues caused by incorrect torch version, cuda version or even python version. I don't want to hurt the environment on my host machine as I have other projects in development. Thus, I created a dockerfile to test this:

FROM nvidia/cuda:11.1.1-runtime-ubuntu20.04

RUN apt-get update

# install python3.8
RUN apt-get install python3.8 python3-pip -y

# install pytorch
RUN pip3 install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111

# copy everything in the current repo to /app
COPY . /app

# install requirements
RUN pip3 install -r /app/requirements.txt

# set working directory to /app
WORKDIR /app

# install librosa dependencies
RUN apt-get install libsndfile1

# start in interactive mode
CMD ["bash"]

Build and run it:

DiffSinger on refactor [x!?] via v3.10.6
> docker build .
[+] Building 82.3s (13/13) FINISHED
 => [internal] load build definition from Docke  0.0s
 => => transferring dockerfile: 692B             0.0s
 => [internal] load .dockerignore                0.0s
 => => transferring context: 51B                 0.0s
 => [internal] load metadata for docker.io/nvid  0.1s
 => [1/8] FROM docker.io/nvidia/cuda:11.1.1-run  0.0s
 => [internal] load build context                0.7s
 => => transferring context: 4.46MB              0.6s
 => CACHED [2/8] RUN apt-get update              0.0s
 => CACHED [3/8] RUN apt-get install python3.8   0.0s
 => CACHED [4/8] RUN pip3 install torch==1.8.2   0.0s
 => [5/8] COPY . /app                           16.3s
 => [6/8] RUN pip3 install -r /app/requirement  46.8s
 => [7/8] WORKDIR /app                           0.0s
 => [8/8] RUN apt-get install libsndfile1 -y     2.0s
 => exporting to image                          16.3s
 => => exporting layers                         16.3s
 => => writing image sha256:203f33d32c6ecf3b567  0.0s

DiffSinger on refactor [x!?] via v3.10.6 took 5h19m50s
> docker run --rm -it --runtime=nvidia --gpus all 203f33

==========
== CUDA ==
==========

CUDA Version 11.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

root@5dfa969f9c09:/app# nvidia-smi
Sat Jul  8 01:26:07 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04              Driver Version: 536.23       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 Ti     On  | 00000000:2D:00.0  On |                  N/A |
|  0%   37C    P8              22W / 290W |   2801MiB /  8192MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        35      G   /Xwayland                                 N/A      |
|    0   N/A  N/A     27381      C   /python3.10                               N/A      |
+---------------------------------------------------------------------------------------+
root@5dfa969f9c09:/app# CUDA_VISIBLE_DEVICES=0 python3 run.py --config data/name/config.yaml --exp_name 0703_name_ds1000
| Hparams chains:  ['configs/basics/base.yaml', 'configs/basics/fs2.yaml', 'configs/acoustic/nomidi.yaml', 'data/name/config.yaml']
| Hparams:
K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, augmentation_args: {},
base_config: [], binarization_args: {'shuffle': True, 'with_align': True, 'with_f0': True, 'with_f0cwt': True, 'with_spk_embed': False, 'with_txt': True, 'with_wav': False}, binarizer_cls: data_gen.acoustic.AcousticBinarizer, binary_data_dir: data/name/binary, check_val_every_n_epoch: 10,
clip_grad_norm: 1, content_cond_steps: [], cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2,
cwt_loss: l1, cwt_std_scale: 0.8, datasets: ['opencpop'], debug: False, dec_ffn_kernel_size: 9,
dec_layers: 4, decay_steps: 50000, decoder_type: fft, dict_dir: , diff_decoder_type: wavenet,
diff_loss_type: l2, dilation_cycle_length: 4, dropout: 0.1, ds_workers: 4, dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'],
dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_ffn_kernel_size: 9, enc_layers: 4,
encoder_K: 8, encoder_type: fft, endless_ds: True, f0_embed_type: continuous, ffn_act: gelu,
ffn_padding: SAME, fft_size: 2048, fmax: 16000, fmin: 40, g2p_dictionary: checkpoints/0703_name_ds1000/opencpop-extension.txt,
gamma: 0.5, gaussian_start: True, gen_dir_name: , gen_tgt_spk_id: -1, hidden_size: 256,
hop_size: 512, infer: False, keep_bins: 128, lambda_commit: 0.25, lambda_energy: 0.0,
lambda_f0: 0.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, lambda_word_dur: 0.0,
load_ckpt: , log_interval: 100, loud_norm: False, lr: 0.0004, max_beta: 0.02,
max_epochs: 1000, max_eval_sentences: 1, max_eval_tokens: 60000, max_frames: 8000, max_input_tokens: 1550,
max_sentences: 12, max_tokens: 80000, max_updates: 320000, mel_loss: ssim:0.5|l1:0.5, mel_vmax: 1.5,
mel_vmin: -6.0, min_level_db: -120, norm_type: gn, num_ckpt_keep: 5, num_heads: 2,
num_sanity_val_steps: 1, num_spk: 1, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.9,
optimizer_adam_beta2: 0.98, original_g2p_dictionary: dictionaries/opencpop-extension.txt, out_wav_norm: False, permanent_ckpt_interval: 40000, permanent_ckpt_start: 120000,
pitch_ar: False, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'], pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: log,
pitch_type: frame, pndm_speedup: 10, pre_align_args: {'allow_no_txt': False, 'denoise': False, 'forced_align': 'mfa', 'txt_processor': 'en', 'use_sox': False, 'use_tone': True}, pre_align_cls: , predictor_dropout: 0.5,
predictor_grad: 0.0, predictor_hidden: -1, predictor_kernel: 5, predictor_layers: 5, prenet_dropout: 0.5,
prenet_hidden_size: 256, pretrain_fs_ckpt: , processed_data_dir: , profile_infer: False, raw_data_dir: ['data/name/raw'],
ref_norm_layer: bn, rel_pos: True, reset_phone_dict: True, residual_channels: 512, residual_layers: 20,
save_best: False, save_ckpt: True, save_codes: ['configs', 'modules', 'src', 'utils'], save_f0: True, save_gt: False,
schedule_type: linear, seed: 1234, sort_by_len: True, speakers: ['name'], spec_max: [0],
spec_min: [-5], spk_cond_steps: [], stop_token_weight: 5.0, task_cls: src.naive_task.NaiveTask, test_ids: [],
test_input_dir: , test_num: 0, test_prefixes: ['NormalSpeech_0', 'NormalSpeech_18', 'NormalSpeech_27', 'NormalSpeech_66'], test_set_name: test, timesteps: 1000,
train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: False, use_gt_f0: False,
use_key_shift_embed: False, use_midi: False, use_nsf: True, use_pitch_embed: True, use_pos_embed: True,
use_speed_embed: False, use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_uv: False,
use_var_enc: False, val_check_interval: 2000, valid_num: 0, valid_set_name: valid, validate: False,
vocoder: NsfHifiGAN, vocoder_ckpt: checkpoints/nsf_hifigan/model, warmup_updates: 2000, wav2spec_eps: 1e-6, weight_decay: 0,
win_size: 2048, work_dir: checkpoints/0703_name_ds1000,
| load phoneme set: ['AP', 'E', 'En', 'SP', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'i0', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'ir', 'iu', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'ui', 'un', 'uo', 'v', 'van', 've', 'vn', 'w', 'x', 'y', 'z', 'zh']
| Mel losses: {'ssim': 0.5, 'l1': 0.5}
| Load HifiGAN:  checkpoints/nsf_hifigan/model
Removing weight norm...
| Load HifiGAN:  checkpoints/nsf_hifigan/model
Removing weight norm...
07/08 01:26:28 AM gpu available: True, used: True
| Copied codes to checkpoints/0703_name_ds1000/codes/20230708012628.
| model Arch:  GaussianDiffusion(
  ...
)
| model Trainable Parameters: 66.557M
Traceback (most recent call last):
  File "run.py", line 15, in <module>
    run_task()
  File "run.py", line 11, in run_task
    task_cls.start()
  File "/app/basics/base_task.py", line 242, in start
    trainer.fit(task)
  File "/app/utils/pl_utils.py", line 497, in fit
    self.run_pretrain_routine(model)
  File "/app/utils/pl_utils.py", line 549, in run_pretrain_routine
    self.restore_weights(model)
  File "/app/utils/pl_utils.py", line 625, in restore_weights
    self.restore_state_if_checkpoint_exists(model)
  File "/app/utils/pl_utils.py", line 663, in restore_state_if_checkpoint_exists
    self.restore(last_ckpt_path, self.on_gpu)
  File "/app/utils/pl_utils.py", line 680, in restore
    self.restore_training_state(checkpoint)
  File "/app/utils/pl_utils.py", line 712, in restore_training_state
    optimizer.load_state_dict(opt_state)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 146, in load_state_dict
    raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
root@5dfa969f9c09:/app#

My questions are:

  1. Am I using a wrong branch of the code to finetune, or it's simply impossible to that? If it's the wrong branch, then what is the correct branch that I should use to finetune these pretrained models?
  2. What's the root cause of optimizer's param mismatch? What should I do to solve that?

Thanks.

yqzhishen commented 1 year ago

What version of Lightning are you using?

med1844 commented 1 year ago

What version of Lightning are you using?

On WSL2:

> pip list | rg lightning
lightning-utilities     0.5.0
pytorch-lightning       1.8.6

On Docker Container:

root@7f696c563553:/app# pip list | grep lightning
pytorch-lightning       0.7.1
yqzhishen commented 1 year ago

Please strictly follow the version defined in requirements.txt. By the way, which checkpoint are you fine tuning? Have you tried PyTorch 1.13, or training from scratch?

med1844 commented 1 year ago

Please strictly follow the version defined in requirements.txt.

In refactor branch, it seems that the version defined is exactly 0.7.1 (source). That's why I wonder if I'm using the wrong branch. Should I try out refactor-v2?

By the way, which checkpoint are you fine tuning?

Here's a list of checkpoints I have tried to finetune:

Have you tried PyTorch 1.13?

No, as it's not mentioned in README, nor tutorials (in refactor branch). I will try it now.

Have you tried training from scratch?

No. I don't have enough amount of data (~12min). I understand that I could use OpenCPoP together to train from scratch, but I would prefer to try out low-cost solutions first before actually go rent A100/H100 servers.

yqzhishen commented 1 year ago

Are you using the same hyper-parameters when finetuning from a checkpoint? If not, the state dict and optimizer states may not match and cause the above error.

However, finetuning on a small dataset from a checkpoint may not be that useful as you think. Training multi-speaker models together with other large datasets is still the recommended solution.

med1844 commented 1 year ago

Are you using the same hyper-parameters when finetuning from a checkpoint? If not, the state dict and optimizer states may not match and cause the above error.

That's very helpful. Here's what difftool shows me:

@@ -174,7 +184,7 @@ use_midi: false
 use_nsf: true
 use_pitch_embed: true
 use_pos_embed: true
-use_speed_embed: false
+use_speed_embed: true
 use_spk_embed: false
 use_spk_id: false
 use_split_spk_id: false

After changing use_speed_embed to true, the model could be loaded correctly:

07/08 05:06:24 AM model and trainer restored from checkpoint: checkpoints/0703_name_ds1000/model_ckpt_steps_360000.ckpt
Validation sanity check:   0%|                                  | 0/1 [00:00<?, ?batch/s]

Unfortunately, there's no "speed" information in my dataset, thus it crashed. But as long as it loads, it means the optimizer issue has been solved. Thank you again for pointing out this.

However, finetuning on a small dataset from a checkpoint may not be that useful as you think. Training multi-speaker models together with other large datasets is still the recommended solution.

I wonder if there's some kind of key difference between DiffSinger and SVC systems, as most SVC systems seems to work well when it comes to finetuning.

Nevertheless, that's a helpful advice, I appreciate it a lot.

yqzhishen commented 1 year ago

use_speed_embed is an option related to time stretching augmentation, and if you didn't turn that on, do not change it to True.

Fine-tuning is always a work around in case the original training data cannot be accessed. SVC systems use pre-trained models trained on large corpus because they do not need labeling, they do not have dictionaries, and for ease of use.

Another point is that DiffSinger is far more flexible in model architecture than most SVC systems (their flexibility is mostly related to tricks at inference time).

In addition, fine-tuning may cause leakage in timbre or styles from the original checkpoint. SVS users are far more sensible in these aspects than SVC users, because SVS systems does not have timbre leakage at all without fine-tuning. Fine-tuning may be very suitable on very large, generic models (like LLMs), but as for speech models, they are relatively small and specific.

med1844 commented 1 year ago

In addition, fine-tuning may cause leakage in timbre or styles from the original checkpoint.

That's the main reason why I want to try out SVS systems. SVC systems cannot capture the speaker styles, which is crucial when similarity matters. Thanks for pointing this out.

I will try to train from scratch then. Thank you again for the detailed insights!