innnky / so-vits-svc

基于vits与softvc的歌声音色转换模型
GNU Affero General Public License v3.0
3.57k stars 1 forks source link

Quality in pure voice? #41

Closed devNegative-asm closed 1 year ago

devNegative-asm commented 1 year ago

I'll post an overview of what I've done and what I got

  1. download each model listed in the readme
  2. split up a 26minute audio track of a single speaker with on-again off-again speaking and silence into 8s - 20s segments (no music)
  3. encode and train using the scripts listed in the English readme for 2000 epochs in 2 training runs INFO:32k:Saving model and optimizer state at iteration 2001 to ./logs/32k/G_18000.pth
  4. run inference on my own pure speech (no music)

Result: with the untrained model (G_0.pth), the output is a mess of static with the trained model, the output sounds a little like the training data, but still has a low frequency noise that makes it sound like {[correct sound], [silence]} on repeat at (guessing) ~20hz

My guess: one or more of the following may or may not be true. So my question is mainly to verify my understanding of the following and learn why generating was not successful

  1. Including silent segments in the training data is bad?
  2. Something wrong with my sound input to inference (too quiet? too low pitch?)
  3. more than 26 minutes of audio is necessary?
  4. should I have included music in my training data or inference clip?
  5. 2000 epochs is too much or not enough?
  6. I'm supposed to do something with D_18000.pth? I don't know what D and G are supposed to be
  7. something wrong with my hubert-soft.pt?
  8. I'm running on python 3.10.6, so one or more of the dependencies might be messed up. I'll put pip freeze output below.
absl-py==1.4.0
appdirs==1.4.4
audioread==3.0.0
cachetools==5.3.0
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.0.1
cmake==3.25.2
contourpy==1.0.7
cycler==0.11.0
Cython==0.29.33
decorator==5.1.1
filelock==3.9.0
fonttools==4.38.0
google-auth==2.16.0
google-auth-oauthlib==0.4.6
googleads==3.8.0
grpcio==1.51.1
httplib2==0.21.0
idna==3.4
imageio==2.25.0
joblib==1.2.0
kiwisolver==1.4.4
librosa==0.9.2
llvmlite==0.39.1
Markdown==3.4.1
MarkupSafe==2.1.2
matplotlib==3.6.3
mpmath==1.2.1
networkx==3.0
numba==0.56.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
oauthlib==3.2.2
packaging==23.0
pandas==1.5.3
Pillow==9.4.0
pooch==1.6.0
praat-parselmouth==0.4.3
protobuf==3.20.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pyparsing==3.0.9
PySocks==1.7.1
python-dateutil==2.8.2
pytorch-triton @ https://download.pytorch.org/whl/nightly/pytorch_triton-2.0.0%2B0d7e753227-cp310-cp310-linux_x86_64.whl
pytz==2022.7.1
PyWavelets==1.4.1
pyworld==0.3.2
PyYAML==6.0
requests==2.28.2
requests-oauthlib==1.3.1
resampy==0.4.2
rsa==4.9
scikit-image==0.19.3
scikit-learn==1.2.1
scikit-maad==1.3.12
scipy==1.10.0
six==1.16.0
soundfile==0.11.0
stopit==1.1.1
suds-jurko==0.6
sympy==1.11.1
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tifffile==2023.1.23.1
torch @ https://download.pytorch.org/whl/nightly/cu118/torch-2.0.0.dev20230201%2Bcu118-cp310-cp310-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/nightly/cu118/torchaudio-2.0.0.dev20230201%2Bcu118-cp310-cp310-linux_x86_64.whl
torchvision @ https://download.pytorch.org/whl/nightly/cu118/torchvision-0.15.0.dev20230201%2Bcu118-cp310-cp310-linux_x86_64.whl
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.26.14
Werkzeug==2.2.2

This is the first VC repo I actually got to work, so kudos to the developers! I'm just struggling to get it to sound right

innnky commented 1 year ago

My guess is that this is due to the insufficient number of your training data. Based on my previous experience, generally at least 1000 wav files are needed for training to achieve better results.

devNegative-asm commented 1 year ago

What about the length of each wav file? Does it matter? Mine were 8 to 20 seconds each

innnky commented 1 year ago

I usually use 2s -10s

devNegative-asm commented 1 year ago

I ran another training session with ~970 clips and I got the same issue. this is what the result sounds like after Saving model and optimizer state at iteration 247 to ./logs/32k/G_20000.pth https://github.com/devNegative-asm/various_files/blob/main/changed_shifted-5.wav

devNegative-asm commented 1 year ago

maybe I'm not using speaker embeddings correctly? I haven't used the gradio UI, so maybe that does it better. I'll close the issue if I find a solution, but otherwise, this seems to be a similar problem as https://github.com/innnky/so-vits-svc/issues/95

devNegative-asm commented 1 year ago

Ok, I think I figured it out. the pitch slider on UI (or trans from the inference script) needs to bring the input voice to well within the vocal range of the target speaker. Not doing that adds the robotic sound positive numbers: increase pitch negative: decrease pitch

I was using 0, but 12 ended up working a lot better for male -> female