padmalcom / Real-Time-Voice-Cloning-German

German model for https://github.com/CorentinJ/Real-Time-Voice-Cloning
Other
35 stars 6 forks source link

Poor vocoder outcome #18

Open gabrielrdw20 opened 2 years ago

gabrielrdw20 commented 2 years ago

Hello, I am fairly new to this topic. I have two problems that I cannot find any solution for. I read the documentation, scrolled all similar issues reported here: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues but didn't find any help there.

Short description: Encoder was trained fine, synthesizer as well. The only problem is my vocoder. Training of the vocoder in very slow and generates unsuited mel spectrograms in the toolbox (but tested wav files are fine). Instead of human speech, the toolbox generates almost noise itself.

Please take a look at the files: https://drive.google.com/drive/folders/1-SKYHRP8zy7vETqtMMJpKv1n7XKidBZL?usp=sharing

Longer description:

  1. I trained all the 3 parts, encoder, synthesizer and a vocoder, but the last one is quite problematic. I trained them all from scratch, having 244 unique Polish speakers. I used (and adjusted to Polish language) the code uploaded on Github by @padmalcom. It looks like my vocoder is trained properly (this opinion is based on the wav filed generated by the vocoder). Somehow, when I open them in the demo_toolbox.py, the predicted mel spectrogram it's not even enar the target one. Is there any chance you might know what could cause the problem?

  2. Till this moment, vocoder did only 14k iterations which might be the issue. This part is going really slow. Should it be like that? It's been 2 days of my PC working non-stop, and achieved only 14k iterations. I have NVIDIA GeForce RTX 3060 Ti and have installed latest releas of CUDA.

Any idea what could have gone wrong? I would be grateful for any suggestions :)

padmalcom commented 2 years ago

Hi @gabrielrdw20, there are a few points you can check:

  1. How good is the outcome of your synthesizer model using the griffinlim vocoder from the toolbox? Is the generated speech okay?
  2. Can you verify that the vocoder training uses you GPU?
  3. Did you change the symbols to the polish alphabet (which contains some special characters as far as I know) for the synthesizer training?
  4. Did you try to use a pretrained vocoder from the original repository? I found out that using a vocoder trained on English language is absolutely okay for German sentences, as well.