myshell-ai / OpenVoice

Instant voice cloning by MIT and MyShell.
https://research.myshell.ai/open-voice
MIT License
29.87k stars 2.94k forks source link

The cloned voice is far from the reference speaker #220

Open aicoder2048 opened 6 months ago

aicoder2048 commented 6 months ago

Hi,

I am trying out Open Voice (v1), and it mechanically worked, but the cloned voice is far from its reference speaker. Sometimes, I gave a male reference speaker mp3, and got back a female voice.

I run the code from "demo_part1.ipynb" and I only changed reference speaker's mp3.

I suspect the torch/embedding version is not compatible, and I am using: (Speech2Rag) OpenVoice> pip show torch Name: torch Version: 2.1.2+cu121 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: C:\Users\Sean2092\miniconda3\Lib\site-packages Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions Required-by: pytorch-lightning, torchaudio, torchmetrics, torchvision

Could someone with success and experience help out? I am sure I got something, libs or settings, incorrect, but I cannot figure out what that might be. Pls help.

Thanks a lot, Sean

aicoder2048 commented 6 months ago

Dose source_se need to be from audio of the same person's voice as source audio to inference to get close or better clone quality?

aicoder2048 commented 6 months ago

I got the following warnings, could any of those warnings make the clone similarity to drastically degrade ?

  1. UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  2. UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
  3. UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED

image

aicoder2048 commented 6 months ago

Dose source_se need to be from audio of the same person's voice as source audio to inference to get close or better clone quality?

I tried to use same (base-speaker) person's voice/mp3 for getting "source_se/tone color embedding" and "source audio to inference" , and a third male voice/mp3 as reference speaker. The resulting cloned audio, which sometime is female with a bit noise, is still far from the reference male audio. Very Bizarred !

so, to my conclusion from the experiment, the source_se and source audio to inference don't have to be from same person, or at least, it doesn't matter towards affecting/improving clone similarity.

just a couplel of sents to share ... have fun

Sean

francogrex commented 2 months ago

It is true that for V1 the reference audio for cloning voice and the generated outputs are not similar. I don't think this is cloning the voice very well