Closed devNegative-asm closed 1 year ago
My guess is that this is due to the insufficient number of your training data. Based on my previous experience, generally at least 1000 wav files are needed for training to achieve better results.
What about the length of each wav file? Does it matter? Mine were 8 to 20 seconds each
I usually use 2s -10s
I ran another training session with ~970 clips and I got the same issue. this is what the result sounds like after Saving model and optimizer state at iteration 247 to ./logs/32k/G_20000.pth
https://github.com/devNegative-asm/various_files/blob/main/changed_shifted-5.wav
maybe I'm not using speaker embeddings correctly? I haven't used the gradio UI, so maybe that does it better. I'll close the issue if I find a solution, but otherwise, this seems to be a similar problem as https://github.com/innnky/so-vits-svc/issues/95
Ok, I think I figured it out. the pitch slider on UI (or trans
from the inference script) needs to bring the input voice to well within the vocal range of the target speaker. Not doing that adds the robotic sound
positive numbers: increase pitch
negative: decrease pitch
I was using 0, but 12 ended up working a lot better for male -> female
I'll post an overview of what I've done and what I got
INFO:32k:Saving model and optimizer state at iteration 2001 to ./logs/32k/G_18000.pth
Result: with the untrained model (G_0.pth), the output is a mess of static with the trained model, the output sounds a little like the training data, but still has a low frequency noise that makes it sound like {[correct sound], [silence]} on repeat at (guessing) ~20hz
My guess: one or more of the following may or may not be true. So my question is mainly to verify my understanding of the following and learn why generating was not successful
pip freeze
output below.This is the first VC repo I actually got to work, so kudos to the developers! I'm just struggling to get it to sound right