myshell-ai / OpenVoice

Instant voice cloning by MIT and MyShell.
https://research.myshell.ai/open-voice
MIT License
28.38k stars 2.78k forks source link

Converter doesn't work for Chinese? #287

Open Strive-for-excellence opened 1 month ago

Strive-for-excellence commented 1 month ago

When I use https://huggingface.co/spaces/myshell-ai/OpenVoiceV2 for inference, I find that the performance is inconsistent with the reference audio.

Text: 该努力我也努力过了,该尝试的我都尝试过了! Reference audio: https://github.com/Strive-for-excellence/show/blob/main/ref.mp3 Base speaker generated audio: https://github.com/Strive-for-excellence/show/blob/main/output_v2_3_ori.wav Generated audio: https://github.com/Strive-for-excellence/show/blob/main/output.wav The base speaker generated audio is normal, but the audio after conversion is not normal.

dosubot[bot] commented 1 month ago

@Strive-for-excellence 你好! 我是Dosu,我是一个可以帮助您解决问题,回答您的问题,并成为贡献者的机器人。我会在等待人类维护者的过程中帮助您。

The inconsistency in the generated audio compared to the reference audio could be due to several factors:

  1. Accent and Emotion: OpenVoice only clones the tone color of the reference speaker and does not clone the accent or emotion. The accent and emotion are controlled by the base speaker TTS model. If the base speaker model does not match the desired accent or emotion, the generated audio will not reflect these characteristics.

  2. Quality of Reference Audio: Ensure that the reference audio is clean, without background noise, and does not contain speech from more than one person. Also, check that the reference audio is not too short and does not contain long blank sections.

  3. Base Speaker Model: For multi-lingual and cross-lingual usage, ensure that you have a base speaker model in the desired language. If you do not have a suitable base speaker model, you can use the OpenAI TTS model as the base speaker.

To address these issues, you can:

Additionally, ensure that you are using the correct base speaker model for Chinese (zh_base_speaker_tts) and that the style parameter is set to 'default' for Chinese text inputs. The model supports English and Chinese primarily, and other languages require a suitable base speaker model.

If the webpage for OpenVoiceV2 on Hugging Face is unresponsive, you might need to wait until it is accessible again to further assess the model's performance or check for updates. You can also refer to the QnA and multi-lingual & cross-lingual examples for additional guidance [1][2].

To continue talking to Dosu, mention @dosu.

Zengyi-Qin commented 1 month ago

Hi - The result you get is normal. The generated voice only clones the tone color, but not any other features such as innotations and accent