prophesier / diff-svc

Singing Voice Conversion via diffusion model
GNU Affero General Public License v3.0
2.63k stars 803 forks source link

Error with input and output sizes #17

Open dillfrescott opened 1 year ago

dillfrescott commented 1 year ago

Yesterday I was inferencing fine, but now it just throws this error:

load chunks from temp
#=====segment start, 7.569s======
jump empty segment
#=====segment start, 27.321s======
load temp crepe f0
executing 'get_pitch' costed 0.045s
hubert (on cuda) time used 0.7498652935028076
sample time step: 100%|██████████| 50/50 [00:03<00:00, 13.22it/s]
executing 'diff_infer' costed 3.807s
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-10-ae1ee0a3d8df>](https://localhost:8080/#) in <module>
      6 wav_gen='michael2.wav'
      7 f0_tst, f0_pred, audio = run_clip(svc_model,file_path=wav_fn, key=key, acc=pndm_speedup, use_crepe=True, use_pe=True, thre=0.05,
----> 8                                         use_gt_mel=False, add_noise_step=500,project_name=project_name,out_path=wav_gen)

9 frames
[/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in interpolate(input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias)
   3906 
   3907     if input.dim() == 3 and mode == "nearest":
-> 3908         return torch._C._nn.upsample_nearest1d(input, output_size, scale_factors)
   3909     if input.dim() == 4 and mode == "nearest":
   3910         return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors)

RuntimeError: Input and output sizes should be greater than 0, but got input (W: 0) and output (W: 0)
dillfrescott commented 1 year ago

I think it might have something to do with training with 44.1khz audio

prophesier commented 1 year ago

I'm not quite sure about what caused this problem, and I haven't seen errors like this in our develop group. But i can give you some suggestions. First, you may try a new environment of python 3.8 to see if it will work. Second, 44.1khz is a experimental frequency right now, and the vocoder of 44.1khz is not released yet. If anybody tries to train a model with this parameter, it may cause errors.

dillfrescott commented 1 year ago

Ah, gotcha. Ill have to wait to use the 44.1khz then :/

dillfrescott commented 1 year ago

@prophesier I'm still getting the exact same error. Are we any closer to figuring out why this is happening?

Mixomo commented 1 year ago

same here!

Mixomo commented 1 year ago

RuntimeError Traceback (most recent call last) in 39 use_gt_mel= False 40 ---> 41 f0_tst, f0_pred, audio = run_clip(svc_model,file_path=wav_fn, key=key, acc=pndm_speedup, use_crepe=use_crepe, use_pe=use_pe, thre=thre, 42 use_gt_mel=use_gt_mel, add_noise_step=add_noise_step,project_name=project_name,out_path=wav_gen) 43

9 frames /content/diff-svc/infer.py in run_clip(svc_model, key, acc, use_pe, use_crepe, thre, use_gt_mel, add_noise_step, project_name, f_name, file_path, out_path, slice_db, **kwargs) 57 np.zeros(length)) 58 else: ---> 59 _f0_tst, _f0_pred, _audio = svc_model.infer(raw_path, key=key, acc=acc, use_pe=use_pe, use_crepe=use_crepe, 60 thre=thre, use_gt_mel=use_gt_mel, add_noise_step=add_noise_step) 61 fix_audio = np.zeros(length)

/content/diff-svc/infer_tools/infer_tool.py in infer(self, in_path, key, acc, use_pe, use_crepe, thre, singer, **kwargs) 165 else: 166 batch['f0_pred'] = outputs.get('f0_denorm') --> 167 return self.after_infer(batch, singer, in_path) 168 169 @timeit

/content/diff-svc/infer_tools/infer_tool.py in run(*args, kwargs) 60 def run(*args, *kwargs): 61 t = time.time() ---> 62 res = func(args, kwargs) 63 print('executing \'%s\' costed %.3fs' % (func.name, time.time() - t)) 64 return res

/content/diff-svc/infer_tools/infer_tool.py in after_infer(self, prediction, singer, in_path) 197 np.save(mel_path, mel_pred) 198 np.save(f0_path, f0_pred) --> 199 wav_pred = self.vocoder.spec2wav(mel_pred, f0=f0_pred) 200 return f0_gt, f0_pred, wav_pred 201

/content/diff-svc/network/vocoders/hifigan.py in spec2wav(self, mel, **kwargs) 68 if f0 is not None and hparams.get('use_nsf'): 69 f0 = torch.FloatTensor(f0[None, :]).to(device) ---> 70 y = self.model(c, f0).view(-1) 71 else: 72 y = self.model(c).view(-1)

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], []

/content/diff-svc/modules/hifigan/hifigan.py in forward(self, x, f0) 145 if f0 is not None: 146 # harmonic-source signal, noise-source signal, uv flag --> 147 f0 = self.f0_upsamp(f0[:, None]).transpose(1, 2) 148 har_source, noi_source, uv = self.m_source(f0) 149 har_source = har_source.transpose(1, 2)

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1129 or _global_forward_hooks or _global_forward_pre_hooks): -> 1130 return forward_call(input, **kwargs) 1131 # Do not call functions when jit is used 1132 full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/upsampling.py in forward(self, input) 151 152 def forward(self, input: Tensor) -> Tensor: --> 153 return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners, 154 recompute_scale_factor=self.recompute_scale_factor) 155

/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py in interpolate(input, size, scale_factor, mode, align_corners, recompute_scale_factor, antialias) 3906 3907 if input.dim() == 3 and mode == "nearest": -> 3908 return torch._C._nn.upsample_nearest1d(input, output_size, scale_factors) 3909 if input.dim() == 4 and mode == "nearest": 3910 return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors)

RuntimeError: Input and output sizes should be greater than 0, but got input (W: 0) and output (W: 0)

Mixomo commented 1 year ago

(i am training in 24 Khz)