marcoppasini / MelGAN-VC

MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms
MIT License
228 stars 53 forks source link

Would you please explain the inference process and what does GRAD do? #2

Closed youngsuenXMLY closed 4 years ago

youngsuenXMLY commented 4 years ago

Hi, I read the code, but when it comes to the inference part, I get confused about two points:

  1. Only 1 wav file is fed to the input of inference process. In my opinion, there should be two input wav files - one is source and the other is target. Would you please explain this?
  2. GRAD is really hard to understand, please give me some guidance. Looking forward to your reply. Thank you!
marcoppasini commented 4 years ago

Hi!

  1. You must train the model with only one target style domain (voice or music genre), so at inference you don't need to feed a target sample to the model.
  2. the GRAD function is a gradient-based method to turn spectrograms back to waveform, which I find to work better that the traditional Griffin-Lim algorithm. You can read the cited paper for more specific info.
CarolinGao commented 4 years ago

Hi! Have you tried the wavenet vocoder when you turn spectrograms back to waveform?

youngsuenXMLY commented 4 years ago

@CarolinGao hi, I didn't follow this repo any further since this issue, although the MelGAN-VC works well for one to one VC. I need any to many or any to any VC, which is more applicable.