prophesier / diff-svc

Singing Voice Conversion via diffusion model
GNU Affero General Public License v3.0
2.64k stars 806 forks source link

RuntimeError: stft requires the return_complex parameter #74

Open ryanlo713 opened 1 year ago

ryanlo713 commented 1 year ago
(python3.8) D:\application\diff-svc>python preprocessing/binarize.py --config training/config_nsf.yaml
| Hparams chains:  ['training/config_nsf.yaml']
| Hparams:
K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, binarization_args: {'shuffle': False, 'with_align': True, 'with_f0': True, 'with_hubert': True, 'with_spk_embed': False, 'with_wav': False},
binarizer_cls: preprocessing.SVCpre.SVCBinarizer, binary_data_dir: data/binary/Ili_Union, check_val_every_n_epoch: 10, choose_test_manually: False, clip_grad_norm: 1,
config_path: training/config.yaml, content_cond_steps: [], cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2,
cwt_loss: l1, cwt_std_scale: 0.8, datasets: ['opencpop'], debug: False, dec_ffn_kernel_size: 9,
dec_layers: 4, decay_steps: 30000, decoder_type: fft, dict_dir: , diff_decoder_type: wavenet,
diff_loss_type: l2, dilation_cycle_length: 4, dropout: 0.1, ds_workers: 4, dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'],
dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 5, enc_ffn_kernel_size: 9, enc_layers: 4,
encoder_K: 8, encoder_type: fft, endless_ds: False, f0_bin: 256, f0_max: 1100.0,
f0_min: 40.0, ffn_act: gelu, ffn_padding: SAME, fft_size: 2048, fmax: 16000,
fmin: 40, fs2_ckpt: , gaussian_start: True, gen_dir_name: , gen_tgt_spk_id: -1,
hidden_size: 256, hop_size: 512, hubert_gpu: True, hubert_path: checkpoints/hubert/hubert_soft.pt, infer: False,
keep_bins: 128, lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 1.0, lambda_ph_dur: 0.3,
lambda_sent_dur: 1.0, lambda_uv: 1.0, lambda_word_dur: 1.0, load_ckpt: , log_interval: 100,
loud_norm: False, lr: 0.0008, max_beta: 0.02, max_epochs: 3000, max_eval_sentences: 1,
max_eval_tokens: 60000, max_frames: 42000, max_input_tokens: 60000, max_sentences: 88, max_tokens: 128000,
max_updates: 1000000, mel_loss: ssim:0.5|l1:0.5, mel_vmax: 1.5, mel_vmin: -6.0, min_level_db: -120,
norm_type: gn, num_ckpt_keep: 10, num_heads: 2, num_sanity_val_steps: 1, num_spk: 1,
num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False,
pe_ckpt: checkpoints/0102_xiaoma_pe/model_ckpt_steps_60000.ckpt, pe_enable: False, perform_enhance: True, pitch_ar: False, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'],
pitch_extractor: parselmouth, pitch_loss: l2, pitch_norm: log, pitch_type: frame, pndm_speedup: 10,
pre_align_args: {'allow_no_txt': False, 'denoise': False, 'forced_align': 'mfa', 'txt_processor': 'zh_g2pM', 'use_sox': True, 'use_tone': False}, pre_align_cls: data_gen.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.1, predictor_hidden: -1,
predictor_kernel: 5, predictor_layers: 5, prenet_dropout: 0.5, prenet_hidden_size: 256, pretrain_fs_ckpt: ,
processed_data_dir: xxx, profile_infer: False, raw_data_dir: data/raw/Ili_Union, ref_norm_layer: bn, rel_pos: True,
reset_phone_dict: True, residual_channels: 384, residual_layers: 20, save_best: False, save_ckpt: True,
save_codes: ['configs', 'modules', 'src', 'utils'], save_f0: True, save_gt: False, schedule_type: linear, seed: 1234,
sort_by_len: True, speaker_id: Ili_Union, spec_max: [0.0], spec_min: [-5.0], spk_cond_steps: [],
stop_token_weight: 5.0, task_cls: training.task.SVC_task.SVCTask, test_ids: [], test_input_dir: , test_num: 0,
test_prefixes: ['test'], test_set_name: test, timesteps: 1000, train_set_name: train, use_crepe: True,
use_denoise: False, use_energy_embed: False, use_gt_dur: False, use_gt_f0: False, use_midi: False,
use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False,
use_split_spk_id: False, use_uv: False, use_var_enc: False, use_vec: False, val_check_interval: 2000,
valid_num: 0, valid_set_name: valid, validate: False, vocoder: network.vocoders.nsf_hifigan.NsfHifiGAN, vocoder_ckpt: checkpoints/nsf_hifigan/g_00105000,
warmup_updates: 2000, wav2spec_eps: 1e-6, weight_decay: 0, win_size: 2048, work_dir: ,

| Binarizer:  <class 'preprocessing.SVCpre.SVCBinarizer'>
spkers:  {'Ili_Union'}
| spk_map:  {'Ili_Union': 0}
  0%|                                                                                            | 0/5 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "preprocessing/binarize.py", line 20, in <module>
    binarize()
  File "preprocessing/binarize.py", line 15, in binarize
    binarizer_cls().process()
  File "D:\application\diff-svc\preprocessing\base_binarizer.py", line 135, in process
    self.process_data_split('valid')
  File "D:\application\diff-svc\preprocessing\base_binarizer.py", line 156, in process_data_split
    item = self.process_item(*a)
  File "D:\application\diff-svc\preprocessing\base_binarizer.py", line 194, in process_item
    return File2Batch.temporary_dict2processed_input(item_name, meta_data, self.phone_encoder, binarization_args)
  File "D:\application\diff-svc\preprocessing\process_pipeline.py", line 112, in temporary_dict2processed_input
    wav, mel = VOCODERS[hparams['vocoder'].split('.')[-1]].wav2spec(temp_dict['wav_fn'])
  File "D:\application\diff-svc\network\vocoders\nsf_hifigan.py", line 89, in wav2spec
    mel_torch = stft.get_mel(wav_torch.unsqueeze(0).to(device)).squeeze(0).T
  File "D:\application\diff-svc\modules\nsf_hifigan\nvSTFT.py", line 95, in get_mel
    spec = torch.stft(y, n_fft, hop_length=hop_length, win_length=win_size, window=self.hann_window[str(y.device)],
  File "D:\application\diff-svc\python3.8\lib\site-packages\torch\functional.py", line 641, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
RuntimeError: stft requires the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release.
hugomjp28 commented 1 year ago

I started getting this error today, fixed it on my fork

https://github.com/hugomjp28/diff-svc/commit/e9bdd231ad2d0636863e4b561676959cc2bc450a