myshell-ai / OpenVoice

Instant voice cloning by MIT and MyShell.
https://research.myshell.ai/open-voice
MIT License
28.64k stars 2.8k forks source link

v2 does not work well with cosyvoice TTS #311

Open xipingL opened 1 week ago

xipingL commented 1 week ago
  1. use CosyVoice Chinese woman to generate audio (first video), then use OpenVoice ToneColorConverter to generate audio(third video) according target_se(second video) that has serious electrical tone
  2. use CosyVoice Chinese man to generate audio (fourth video), then use OpenVoice ToneColorConverter to generate audio(sixth video) according target_se(fifth video) that not corresponding timbre was converted

https://github.com/user-attachments/assets/1c7a1f7c-6352-46d2-b607-0d8988a445f8

https://github.com/user-attachments/assets/1165b13f-67c4-481c-8f07-842a587943e4

https://github.com/user-attachments/assets/693f0fd4-45da-4302-9f3b-7c7a2d42e36d

https://github.com/user-attachments/assets/42a0cc0b-e1ec-4a04-bc7f-33fdb761ec5d

https://github.com/user-attachments/assets/39c4675e-84e5-4c40-b526-80d57820dd9f

https://github.com/user-attachments/assets/a8ecf31e-0f25-4d77-8a9a-3fb425c75279

Thanks.

dosubot[bot] commented 1 week ago

@xipingL Hello there! I'm a bot designed to help you with bugs, questions, and contributions. I'm here to assist you while waiting for a human maintainer.

I found a similar open issue: Converter doesn't work for Chinese?, which was last updated on August 03, 2024. The suggested solution includes:

  1. Accent and Emotion: OpenVoice only clones the tone color of the reference speaker and does not clone the accent or emotion. The base speaker TTS model controls these aspects.
  2. Quality of Reference Audio: Ensure the reference audio is clean, without background noise, and does not contain speech from more than one person.
  3. Base Speaker Model: For multi-lingual and cross-lingual usage, ensure you have a base speaker model in the desired language. If not, consider using the OpenAI TTS model as the base speaker.

Additionally, ensure you are using the correct base speaker model for Chinese (zh_base_speaker_tts) and that the style parameter is set to 'default' for Chinese text inputs [1].

The issue with v2 not working well with CosyVoice TTS, specifically when using CosyVoice Chinese woman and man to generate audio and then using OpenVoice ToneColorConverter, is likely due to OpenVoice only cloning the tone color and not the accent or emotion. The base speaker models from MeloTTS, which OpenVoice relies on, may not perfectly align with CosyVoice TTS outputs [2][3][4].

To continue talking to Dosu, mention @dosu.