Open ryanlo713 opened 1 year ago
(python3.8) D:\application\diff-svc>python preprocessing/binarize.py --config training/config_nsf.yaml | Hparams chains: ['training/config_nsf.yaml'] | Hparams: K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, binarization_args: {'shuffle': False, 'with_align': True, 'with_f0': True, 'with_hubert': True, 'with_spk_embed': False, 'with_wav': False}, binarizer_cls: preprocessing.SVCpre.SVCBinarizer, binary_data_dir: data/binary/Ili_Union, check_val_every_n_epoch: 10, choose_test_manually: False, clip_grad_norm: 1, config_path: training/config.yaml, content_cond_steps: [], cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1, cwt_std_scale: 0.8, datasets: ['opencpop'], debug: False, dec_ffn_kernel_size: 9, dec_layers: 4, decay_steps: 30000, decoder_type: fft, dict_dir: , diff_decoder_type: wavenet, diff_loss_type: l2, dilation_cycle_length: 4, dropout: 0.1, ds_workers: 4, dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 5, enc_ffn_kernel_size: 9, enc_layers: 4, encoder_K: 8, encoder_type: fft, endless_ds: False, f0_bin: 256, f0_max: 1100.0, f0_min: 40.0, ffn_act: gelu, ffn_padding: SAME, fft_size: 2048, fmax: 16000, fmin: 40, fs2_ckpt: , gaussian_start: True, gen_dir_name: , gen_tgt_spk_id: -1, hidden_size: 256, hop_size: 512, hubert_gpu: True, hubert_path: checkpoints/hubert/hubert_soft.pt, infer: False, keep_bins: 128, lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 1.0, lambda_ph_dur: 0.3, lambda_sent_dur: 1.0, lambda_uv: 1.0, lambda_word_dur: 1.0, load_ckpt: , log_interval: 100, loud_norm: False, lr: 0.0008, max_beta: 0.02, max_epochs: 3000, max_eval_sentences: 1, max_eval_tokens: 60000, max_frames: 42000, max_input_tokens: 60000, max_sentences: 88, max_tokens: 128000, max_updates: 1000000, mel_loss: ssim:0.5|l1:0.5, mel_vmax: 1.5, mel_vmin: -6.0, min_level_db: -120, norm_type: gn, num_ckpt_keep: 10, num_heads: 2, num_sanity_val_steps: 1, num_spk: 1, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False, pe_ckpt: checkpoints/0102_xiaoma_pe/model_ckpt_steps_60000.ckpt, pe_enable: False, perform_enhance: True, pitch_ar: False, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'], pitch_extractor: parselmouth, pitch_loss: l2, pitch_norm: log, pitch_type: frame, pndm_speedup: 10, pre_align_args: {'allow_no_txt': False, 'denoise': False, 'forced_align': 'mfa', 'txt_processor': 'zh_g2pM', 'use_sox': True, 'use_tone': False}, pre_align_cls: data_gen.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.1, predictor_hidden: -1, predictor_kernel: 5, predictor_layers: 5, prenet_dropout: 0.5, prenet_hidden_size: 256, pretrain_fs_ckpt: , processed_data_dir: xxx, profile_infer: False, raw_data_dir: data/raw/Ili_Union, ref_norm_layer: bn, rel_pos: True, reset_phone_dict: True, residual_channels: 384, residual_layers: 20, save_best: False, save_ckpt: True, save_codes: ['configs', 'modules', 'src', 'utils'], save_f0: True, save_gt: False, schedule_type: linear, seed: 1234, sort_by_len: True, speaker_id: Ili_Union, spec_max: [0.0], spec_min: [-5.0], spk_cond_steps: [], stop_token_weight: 5.0, task_cls: training.task.SVC_task.SVCTask, test_ids: [], test_input_dir: , test_num: 0, test_prefixes: ['test'], test_set_name: test, timesteps: 1000, train_set_name: train, use_crepe: True, use_denoise: False, use_energy_embed: False, use_gt_dur: False, use_gt_f0: False, use_midi: False, use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_uv: False, use_var_enc: False, use_vec: False, val_check_interval: 2000, valid_num: 0, valid_set_name: valid, validate: False, vocoder: network.vocoders.nsf_hifigan.NsfHifiGAN, vocoder_ckpt: checkpoints/nsf_hifigan/g_00105000, warmup_updates: 2000, wav2spec_eps: 1e-6, weight_decay: 0, win_size: 2048, work_dir: , | Binarizer: <class 'preprocessing.SVCpre.SVCBinarizer'> spkers: {'Ili_Union'} | spk_map: {'Ili_Union': 0} 0%| | 0/5 [00:01<?, ?it/s] Traceback (most recent call last): File "preprocessing/binarize.py", line 20, in <module> binarize() File "preprocessing/binarize.py", line 15, in binarize binarizer_cls().process() File "D:\application\diff-svc\preprocessing\base_binarizer.py", line 135, in process self.process_data_split('valid') File "D:\application\diff-svc\preprocessing\base_binarizer.py", line 156, in process_data_split item = self.process_item(*a) File "D:\application\diff-svc\preprocessing\base_binarizer.py", line 194, in process_item return File2Batch.temporary_dict2processed_input(item_name, meta_data, self.phone_encoder, binarization_args) File "D:\application\diff-svc\preprocessing\process_pipeline.py", line 112, in temporary_dict2processed_input wav, mel = VOCODERS[hparams['vocoder'].split('.')[-1]].wav2spec(temp_dict['wav_fn']) File "D:\application\diff-svc\network\vocoders\nsf_hifigan.py", line 89, in wav2spec mel_torch = stft.get_mel(wav_torch.unsqueeze(0).to(device)).squeeze(0).T File "D:\application\diff-svc\modules\nsf_hifigan\nvSTFT.py", line 95, in get_mel spec = torch.stft(y, n_fft, hop_length=hop_length, win_length=win_size, window=self.hann_window[str(y.device)], File "D:\application\diff-svc\python3.8\lib\site-packages\torch\functional.py", line 641, in stft return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
I started getting this error today, fixed it on my fork
https://github.com/hugomjp28/diff-svc/commit/e9bdd231ad2d0636863e4b561676959cc2bc450a