prophesier / diff-svc

Singing Voice Conversion via diffusion model
GNU Affero General Public License v3.0
2.61k stars 798 forks source link

Running for the first time and i got this error #310

Open jxggx opened 1 year ago

jxggx commented 1 year ago

D:\AI\diff-svc>python preprocessing/binarize.py --config training/config_nsf.yaml | Hparams chains: ['training/config_nsf.yaml'] | Hparams: K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, binarization_args: {'shuffle': False, 'with_align': True, 'with_f0': True, 'with_hubert': True, 'with_spk_embed': False, 'with_wav': False}, binarizer_cls: preprocessing.SVCpre.SVCBinarizer, binary_data_dir: data/binary/nseebmytalk, check_val_every_n_epoch: 10, choose_test_manually: False, clip_grad_norm: 1, config_path: training/config_nsf.yaml, content_cond_steps: [], cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1, cwt_std_scale: 0.8, datasets: ['opencpop'], debug: False, dec_ffn_kernel_size: 9, dec_layers: 4, decay_steps: 40000, decoder_type: fft, dict_dir: , diff_decoder_type: wavenet, diff_loss_type: l2, dilation_cycle_length: 4, dropout: 0.1, ds_workers: 4, dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 5, enc_ffn_kernel_size: 9, enc_layers: 4, encoder_K: 8, encoder_type: fft, endless_ds: False, f0_bin: 256, f0_max: 1100.0, f0_min: 40.0, ffn_act: gelu, ffn_padding: SAME, fft_size: 2048, fmax: 16000, fmin: 40, fs2_ckpt: , gaussian_start: True, gen_dir_name: , gen_tgt_spk_id: -1, hidden_size: 256, hop_size: 512, hubert_gpu: True, hubert_path: checkpoints/hubert/hubert_soft.pt, infer: False, keep_bins: 128, lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 1.0, lambda_ph_dur: 0.3, lambda_sent_dur: 1.0, lambda_uv: 1.0, lambda_word_dur: 1.0, load_ckpt: D:\AI\diff-svc\checkpoints\nsf_hifigan, log_interval: 100, loud_norm: False, lr: 0.0008, max_beta: 0.02, max_epochs: 3000, max_eval_sentences: 1, max_eval_tokens: 60000, max_frames: 42000, max_input_tokens: 60000, max_sentences: 88, max_tokens: 128000, max_updates: 1000000, mel_loss: ssim:0.5|l1:0.5, mel_vmax: 1.5, mel_vmin: -6.0, min_level_db: -120, no_fs2: True, norm_type: gn, num_ckpt_keep: 10, num_heads: 2, num_sanity_val_steps: 1, num_spk: 1, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False, pe_ckpt: checkpoints/0102_xiaoma_pe/model_ckpt_steps_60000.ckpt, pe_enable: False, perform_enhance: True, pitch_ar: False, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'], pitch_extractor: parselmouth, pitch_loss: l2, pitch_norm: log, pitch_type: frame, pndm_speedup: 10, pre_align_args: {'allow_no_txt': False, 'denoise': False, 'forced_align': 'mfa', 'txt_processor': 'zh_g2pM', 'use_sox': True, 'use_tone': False}, pre_align_cls: data_gen.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.1, predictor_hidden: -1, predictor_kernel: 5, predictor_layers: 5, prenet_dropout: 0.5, prenet_hidden_size: 256, pretrain_fs_ckpt: , processed_data_dir: xxx, profile_infer: False, raw_data_dir: data/raw/nseebmytalk, ref_norm_layer: bn, rel_pos: True, reset_phone_dict: True, residual_channels: 384, residual_layers: 20, save_best: False, save_ckpt: True, save_codes: ['configs', 'modules', 'src', 'utils'], save_f0: True, save_gt: False, schedule_type: linear, seed: 1234, sort_by_len: True, speaker_id: nseebmytalk, spec_max: [0.0], spec_min: [-5.0], spk_cond_steps: [], stop_token_weight: 5.0, task_cls: training.task.SVC_task.SVCTask, test_ids: [], test_input_dir: , test_num: 0, test_prefixes: ['test'], test_set_name: test, timesteps: 1000, train_set_name: train, use_crepe: True, use_denoise: False, use_energy_embed: False, use_gt_dur: False, use_gt_f0: False, use_midi: False, use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_uv: False, use_var_enc: False, use_vec: False, val_check_interval: 2000, valid_num: 0, valid_set_name: valid, validate: False, vocoder: network.vocoders.nsf_hifigan.NsfHifiGAN, vocoder_ckpt: checkpoints/nsf_hifigan/model, warmup_updates: 2000, wav2spec_eps: 1e-6, weight_decay: 0, win_size: 2048, work_dir: , | Binarizer: <class 'preprocessing.SVCpre.SVCBinarizer'> spkers: {'nseebmytalk'} | spk_map: {'nseebmytalk': 0} 0%| | 0/5 [00:01<?, ?it/s] Traceback (most recent call last): File "D:\AI\diff-svc\preprocessing\binarize.py", line 20, in binarize() File "D:\AI\diff-svc\preprocessing\binarize.py", line 15, in binarize binarizer_cls().process() File "D:\AI\diff-svc\preprocessing\base_binarizer.py", line 135, in process self.process_data_split('valid') File "D:\AI\diff-svc\preprocessing\base_binarizer.py", line 156, in process_data_split item = self.process_item(*a) File "D:\AI\diff-svc\preprocessing\base_binarizer.py", line 194, in process_item return File2Batch.temporary_dict2processed_input(item_name, meta_data, self.phone_encoder, binarization_args) File "D:\AI\diff-svc\preprocessing\process_pipeline.py", line 112, in temporary_dict2processed_input wav, mel = VOCODERS[hparams['vocoder'].split('.')[-1]].wav2spec(temp_dict['wav_fn']) File "D:\AI\diff-svc\network\vocoders\nsf_hifigan.py", line 89, in wav2spec mel_torch = stft.get_mel(wav_torch.unsqueeze(0).to(device)).squeeze(0).T File "D:\AI\diff-svc\modules\nsf_hifigan\nvSTFT.py", line 100, in get_mel spec = torch.matmul(self.melbasis[str(fmax)+''+str(y.device)], spec) RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x1025 and 1x1025)

i am new to this, i don't know where i went wrong and how to fix this?