prophesier / diff-svc

Singing Voice Conversion via diffusion model
GNU Affero General Public License v3.0
2.64k stars 804 forks source link

Python Notebook fails on training #37

Open jamierpond opened 1 year ago

jamierpond commented 1 year ago

Hi! Jus trying to run a simple demo. Currently following all the demos/examples/other people's tutorials and I have everything set up the same as them, but i keep getting errors. I'm wondering if you would please point me in the right direction?

/content/diff-svc
| Hparams chains:  ['/content/diff-svc/training/config_nsf.yaml']
| Hparams: 
;33;mK_step: 1000, ;33;maccumulate_grad_batches: 1, ;33;maudio_num_mel_bins: 128, ;33;maudio_sample_rate: 44100, ;33;mbinarization_args: {'shuffle': False, 'with_align': True, 'with_f0': True, 'with_hubert': True, 'with_spk_embed': False, 'with_wav': False}, 
;33;mbinarizer_cls: preprocessing.SVCpre.SVCBinarizer, ;33;mbinary_data_dir: data/binary/neer, ;33;mcheck_val_every_n_epoch: 10, ;33;mchoose_test_manually: False, ;33;mclip_grad_norm: 1, 
;33;mconfig_path: training/config_nsf.yaml, ;33;mcontent_cond_steps: [], ;33;mcwt_add_f0_loss: False, ;33;mcwt_hidden_size: 128, ;33;mcwt_layers: 2, 
;33;mcwt_loss: l1, ;33;mcwt_std_scale: 0.8, ;33;mdatasets: ['opencpop'], ;33;mdebug: False, ;33;mdec_ffn_kernel_size: 9, 
;33;mdec_layers: 4, ;33;mdecay_steps: 20000, ;33;mdecoder_type: fft, ;33;mdict_dir: , ;33;mdiff_decoder_type: wavenet, 
;33;mdiff_loss_type: l2, ;33;mdilation_cycle_length: 4, ;33;mdropout: 0.1, ;33;mds_workers: 4, ;33;mdur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], 
;33;mdur_loss: mse, ;33;mdur_predictor_kernel: 3, ;33;mdur_predictor_layers: 5, ;33;menc_ffn_kernel_size: 9, ;33;menc_layers: 4, 
;33;mencoder_K: 8, ;33;mencoder_type: fft, ;33;mendless_ds: True, ;33;mf0_bin: 256, ;33;mf0_max: 1100.0, 
;33;mf0_min: 40.0, ;33;mffn_act: gelu, ;33;mffn_padding: SAME, ;33;mfft_size: 2048, ;33;mfmax: 16000, 
;33;mfmin: 40, ;33;mfs2_ckpt: , ;33;mgaussian_start: True, ;33;mgen_dir_name: , ;33;mgen_tgt_spk_id: -1, 
;33;mhidden_size: 256, ;33;mhop_size: 512, ;33;mhubert_gpu: True, ;33;mhubert_path: checkpoints/hubert/hubert_soft.pt, ;33;minfer: False, 
;33;mkeep_bins: 128, ;33;mlambda_commit: 0.25, ;33;mlambda_energy: 0.0, ;33;mlambda_f0: 1.0, ;33;mlambda_ph_dur: 0.3, 
;33;mlambda_sent_dur: 1.0, ;33;mlambda_uv: 1.0, ;33;mlambda_word_dur: 1.0, ;33;mload_ckpt: /content/diff-svc/pretrain/nehito.ckpt, ;33;mlog_interval: 100, 
;33;mloud_norm: False, ;33;mlr: 0.0008, ;33;mmax_beta: 0.02, ;33;mmax_epochs: 3000, ;33;mmax_eval_sentences: 1, 
;33;mmax_eval_tokens: 60000, ;33;mmax_frames: 42000, ;33;mmax_input_tokens: 60000, ;33;mmax_sentences: 12, ;33;mmax_tokens: 128000, 
;33;mmax_updates: 1000000, ;33;mmel_loss: ssim:0.5|l1:0.5, ;33;mmel_vmax: 1.5, ;33;mmel_vmin: -6.0, ;33;mmin_level_db: -120, 
;33;mno_fs2: True, ;33;mnorm_type: gn, ;33;mnum_ckpt_keep: 10, ;33;mnum_heads: 2, ;33;mnum_sanity_val_steps: 1, 
;33;mnum_spk: 1, ;33;mnum_test_samples: 0, ;33;mnum_valid_plots: 10, ;33;moptimizer_adam_beta1: 0.9, ;33;moptimizer_adam_beta2: 0.98, 
;33;mout_wav_norm: False, ;33;mpe_ckpt: checkpoints/0102_xiaoma_pe/model_ckpt_steps_60000.ckpt, ;33;mpe_enable: False, ;33;mperform_enhance: True, ;33;mpitch_ar: False, 
;33;mpitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'], ;33;mpitch_extractor: parselmouth, ;33;mpitch_loss: l2, ;33;mpitch_norm: log, ;33;mpitch_type: frame, 
;33;mpndm_speedup: 10, ;33;mpre_align_args: {'allow_no_txt': False, 'denoise': False, 'forced_align': 'mfa', 'txt_processor': 'zh_g2pM', 'use_sox': True, 'use_tone': False}, ;33;mpre_align_cls: data_gen.singing.pre_align.SingingPreAlign, ;33;mpredictor_dropout: 0.5, ;33;mpredictor_grad: 0.1, 
;33;mpredictor_hidden: -1, ;33;mpredictor_kernel: 5, ;33;mpredictor_layers: 5, ;33;mprenet_dropout: 0.5, ;33;mprenet_hidden_size: 256, 
;33;mpretrain_fs_ckpt: , ;33;mprocessed_data_dir: xxx, ;33;mprofile_infer: False, ;33;mraw_data_dir: data/raw/neer, ;33;mref_norm_layer: bn, 
;33;mrel_pos: True, ;33;mreset_phone_dict: True, ;33;mresidual_channels: 384, ;33;mresidual_layers: 20, ;33;msave_best: False, 
;33;msave_ckpt: True, ;33;msave_codes: ['configs', 'modules', 'src', 'utils'], ;33;msave_f0: True, ;33;msave_gt: False, ;33;mschedule_type: linear, 
;33;mseed: 1234, ;33;msort_by_len: True, ;33;mspeaker_id: neer, ;33;mspec_max: [-0.07976219058036804, 0.3064012825489044, 0.45079874992370605, 0.48896849155426025, 0.38102585077285767, 0.5545408129692078, 0.6556591391563416, 0.5011460781097412, 0.7585625052452087, 0.7933887243270874, 0.7276718020439148, 0.6568117141723633, 0.8160334825515747, 0.7098748087882996, 0.7070586681365967, 0.9631615281105042, 0.8693066835403442, 0.8992214202880859, 0.8334618210792542, 0.9382892847061157, 0.761588454246521, 1.0139938592910767, 0.8147022128105164, 0.8377708196640015, 0.8404781818389893, 0.5279245376586914, 0.7715780735015869, 0.5754967331886292, 0.19373822212219238, 0.11457031220197678, -0.048836078494787216, 0.2835775315761566, 0.1506994366645813, -0.016768964007496834, 0.07266628742218018, -0.05616551637649536, -0.010572524741292, 0.1133032739162445, 0.16342110931873322, 0.035064052790403366, 0.3116454482078552, 0.16785651445388794, 0.1354154646396637, 0.36229264736175537, 0.372775673866272, -0.10152062773704529, 0.22035335004329681, 0.183604434132576, 0.04665748029947281, 0.23221279680728912, 0.21843412518501282, 0.049887944012880325, -0.05100967362523079, -0.0010432127164676785, -0.06516791135072708, 0.07901491224765778, -0.18570756912231445, -0.14707334339618683, -0.11538795381784439, -0.1341129094362259, -0.15978987514972687, -0.18778416514396667, -0.2038293480873108, -0.25516536831855774, -0.24493663012981415, -0.15004149079322815, -0.016140246763825417, -0.07177135348320007, -0.3963303565979004, -0.3779948353767395, -0.25783461332321167, -0.16094177961349487, -0.23505426943302155, -0.3541640043258667, -0.34247317910194397, -0.3881177306175232, -0.4593522846698761, -0.5756832957267761, -0.35765165090560913, -0.542741060256958, -0.4082295298576355, -0.4770561158657074, -0.17004281282424927, -0.27877169847488403, -0.15326324105262756, -0.4180527925491333, -0.27339401841163635, -0.23254677653312683, -0.29365968704223633, -0.33521631360054016, -0.3491170406341553, -0.18533602356910706, -0.29260891675949097, -0.44137561321258545, -0.6128101944923401, -0.731763482093811, -0.6580878496170044, -0.11427026987075806, -0.3944733738899231, -0.6505616903305054, -0.6488122344017029, -0.7484522461891174, -0.7040322422981262, -0.6145080924034119, -0.531133770942688, -0.5737754702568054, -0.6910640597343445, -0.6721180081367493, -0.8550227284431458, -0.7104114294052124, -0.6984644532203674, -0.8648133277893066, -1.0164130926132202, -1.0275567770004272, -1.1420173645019531, -1.068782925605774, -1.244425654411316, -1.302030086517334, -1.5661638975143433, -1.639020562171936, -1.697121500968933, -1.9838589429855347, -2.2957139015197754, -2.2596089839935303, -2.119849443435669, -2.2869279384613037, -2.358459711074829, -2.3582921028137207], ;33;mspec_min: [-4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102, -4.999994277954102], 
;33;mspk_cond_steps: [], ;33;mstop_token_weight: 5.0, ;33;mtask_cls: training.task.SVC_task.SVCTask, ;33;mtest_ids: [], ;33;mtest_input_dir: , 
;33;mtest_num: 0, ;33;mtest_prefixes: ['test'], ;33;mtest_set_name: test, ;33;mtimesteps: 1000, ;33;mtrain_set_name: train, 
;33;muse_crepe: True, ;33;muse_denoise: False, ;33;muse_energy_embed: False, ;33;muse_gt_dur: False, ;33;muse_gt_f0: False, 
;33;muse_midi: False, ;33;muse_nsf: True, ;33;muse_pitch_embed: True, ;33;muse_pos_embed: True, ;33;muse_spk_embed: False, 
;33;muse_spk_id: False, ;33;muse_split_spk_id: False, ;33;muse_uv: False, ;33;muse_var_enc: False, ;33;muse_vec: False, 
;33;mval_check_interval: 1000, ;33;mvalid_num: 0, ;33;mvalid_set_name: valid, ;33;mvalidate: False, ;33;mvocoder: network.vocoders.nsf_hifigan.NsfHifiGAN, 
;33;mvocoder_ckpt: checkpoints/nsf_hifigan/model, ;33;mwarmup_updates: 2000, ;33;mwav2spec_eps: 1e-6, ;33;mweight_decay: 0, ;33;mwin_size: 2048, 
;33;mwork_dir: checkpoints/neer, 
| Mel losses: {'ssim': 0.5, 'l1': 0.5}
| Load HifiGAN:  checkpoints/nsf_hifigan/model
Removing weight norm...
12/14 10:52:47 PM gpu available: True, used: True
Traceback (most recent call last):
  File "run.py", line 15, in <module>
    run_task()
  File "run.py", line 11, in run_task
    task_cls.start()
  File "/content/diff-svc/training/task/base_task.py", line 234, in start
    trainer.fit(task)
  File "/content/diff-svc/utils/pl_utils.py", line 487, in fit
    model.model = model.build_model()
  File "/content/diff-svc/training/task/fs2.py", line 75, in build_model
    self.load_ckpt(hparams['load_ckpt'], strict=True)
  File "/content/diff-svc/training/task/base_task.py", line 84, in load_ckpt
    utils.load_ckpt(self.__getattr__(current_model_name), ckpt_base_dir, current_model_name, force, strict)
  File "/content/diff-svc/utils/__init__.py", line 202, in load_ckpt
    cur_model.load_state_dict(state_dict, strict=strict)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GaussianDiffusion:
    Unexpected key(s) in state_dict: "fs2.encoder.layers.0.op.layer_norm1.weight", "fs2.encoder.layers.0.op.layer_norm1.bias", "fs2.encoder.layers.0.op.self_attn.in_proj_weight", "fs2.encoder.layers.0.op.self_attn.out_proj.weight", "fs2.encoder.layers.0.op.layer_norm2.weight", "fs2.encoder.layers.0.op.layer_norm2.bias", "fs2.encoder.layers.0.op.ffn.ffn_1.weight", "fs2.encoder.layers.0.op.ffn.ffn_1.bias", "fs2.encoder.layers.0.op.ffn.ffn_2.weight", "fs2.encoder.layers.0.op.ffn.ffn_2.bias", "fs2.encoder.layers.1.op.layer_norm1.weight", "fs2.encoder.layers.1.op.layer_norm1.bias", "fs2.encoder.layers.1.op.self_attn.in_proj_weight", "fs2.encoder.layers.1.op.self_attn.out_proj.weight", "fs2.encoder.layers.1.op.layer_norm2.weight", "fs2.encoder.layers.1.op.layer_norm2.bias", "fs2.encoder.layers.1.op.ffn.ffn_1.weight", "fs2.encoder.layers.1.op.ffn.ffn_1.bias", "fs2.encoder.layers.1.op.ffn.ffn_2.weight", "fs2.encoder.layers.1.op.ffn.ffn_2.bias", "fs2.encoder.layers.2.op.layer_norm1.weight", "fs2.encoder.layers.2.op.layer_norm1.bias", "fs2.encoder.layers.2.op.self_attn.in_proj_weight", "fs2.encoder.layers.2.op.self_attn.out_proj.weight", "fs2.encoder.layers.2.op.layer_norm2.weight", "fs2.encoder.layers.2.op.layer_norm2.bias", "fs2.encoder.layers.2.op.ffn.ffn_1.weight", "fs2.encoder.layers.2.op.ffn.ffn_1.bias", "fs2.encoder.layers.2.op.ffn.ffn_2.weight", "fs2.encoder.layers.2.op.ffn.ffn_2.bias", "fs2.encoder.layers.3.op.layer_norm1.weight", "fs2.encoder.layers.3.op.layer_norm1.bias", "fs2.encoder.layers.3.op.self_attn.in_proj_weight", "fs2.encoder.layers.3.op.self_attn.out_proj.weight", "fs2.encoder.layers.3.op.layer_norm2.weight", "fs2.encoder.layers.3.op.layer_norm2.bias", "fs2.encoder.layers.3.op.ffn.ffn_1.weight", "fs2.encoder.layers.3.op.ffn.ffn_1.bias", "fs2.encoder.layers.3.op.ffn.ffn_2.weight", "fs2.encoder.layers.3.op.ffn.ffn_2.bias", "fs2.encoder.layer_norm.weight", "fs2.encoder.layer_norm.bias", "fs2.decoder.pos_embed_alpha", "fs2.decoder.embed_positions._float_tensor", "fs2.decoder.layers.0.op.layer_norm1.weight", "fs2.decoder.layers.0.op.layer_norm1.bias", "fs2.decoder.layers.0.op.self_attn.in_proj_weight", "fs2.decoder.layers.0.op.self_attn.out_proj.weight", "fs2.decoder.layers.0.op.layer_norm2.weight", "fs2.decoder.layers.0.op.layer_norm2.bias", "fs2.decoder.layers.0.op.ffn.ffn_1.weight", "fs2.decoder.layers.0.op.ffn.ffn_1.bias", "fs2.decoder.layers.0.op.ffn.ffn_2.weight", "fs2.decoder.layers.0.op.ffn.ffn_2.bias", "fs2.decoder.layers.1.op.layer_norm1.weight", "fs2.decoder.layers.1.op.layer_norm1.bias", "fs2.decoder.layers.1.op.self_attn.in_proj_weight", "fs2.decoder.layers.1.op.self_attn.out_proj.weight", "fs2.decoder.layers.1.op.layer_norm2.weight", "fs2.decoder.layers.1.op.layer_norm2.bias", "fs2.decoder.layers.1.op.ffn.ffn_1.weight", "fs2.decoder.layers.1.op.ffn.ffn_1.bias", "fs2.decoder.layers.1.op.ffn.ffn_2.weight", "fs2.decoder.layers.1.op.ffn.ffn_2.bias", "fs2.decoder.layers.2.op.layer_norm1.weight", "fs2.decoder.layers.2.op.layer_norm1.bias", "fs2.decoder.layers.2.op.self_attn.in_proj_weight", "fs2.decoder.layers.2.op.self_attn.out_proj.weight", "fs2.decoder.layers.2.op.layer_norm2.weight", "fs2.decoder.layers.2.op.layer_norm2.bias", "fs2.decoder.layers.2.op.ffn.ffn_1.weight", "fs2.decoder.layers.2.op.ffn.ffn_1.bias", "fs2.decoder.layers.2.op.ffn.ffn_2.weight", "fs2.decoder.layers.2.op.ffn.ffn_2.bias", "fs2.decoder.layers.3.op.layer_norm1.weight", "fs2.decoder.layers.3.op.layer_norm1.bias", "fs2.decoder.layers.3.op.self_attn.in_proj_weight", "fs2.decoder.layers.3.op.self_attn.out_proj.weight", "fs2.decoder.layers.3.op.layer_norm2.weight", "fs2.decoder.layers.3.op.layer_norm2.bias", "fs2.decoder.layers.3.op.ffn.ffn_1.weight", "fs2.decoder.layers.3.op.ffn.ffn_1.bias", "fs2.decoder.layers.3.op.ffn.ffn_2.weight", "fs2.decoder.layers.3.op.ffn.ffn_2.bias", "fs2.decoder.layer_norm.weight", "fs2.decoder.layer_norm.bias". 
    size mismatch for spec_min: copying a param with shape torch.Size([1, 1, 80]) from checkpoint, the shape in current model is torch.Size([1, 1, 128]).
    size mismatch for spec_max: copying a param with shape torch.Size([1, 1, 80]) from checkpoint, the shape in current model is torch.Size([1, 1, 128]).
    size mismatch for denoise_fn.input_projection.weight: copying a param with shape torch.Size([256, 80, 1]) from checkpoint, the shape in current model is torch.Size([384, 128, 1]).
    size mismatch for denoise_fn.input_projection.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
    size mismatch for denoise_fn.mlp.0.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([1536, 384]).
    size mismatch for denoise_fn.mlp.0.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
    size mismatch for denoise_fn.mlp.2.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([384, 1536]).
    size mismatch for denoise_fn.mlp.2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
    size mismatch for denoise_fn.residual_layers.0.dilated_conv.weight: copying a param with shape torch.Size([512, 256, 3]) from checkpoint, the shape in current model is torch.Size([768, 384, 3]).
    size mismatch for denoise_fn.residual_layers.0.dilated_conv.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).

... blah blah blah more of the same
prophesier commented 1 year ago

If you are using colab for training, it is best to ask the author of colab because I am not sure what modifications colab has made to the source code. Alternatively, you can ask on the discord channel on the homepage, where most colab authors should be present.