Setting the voice of text to speech does not work

cycwll commented 1 month ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.16

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I used the cosyvoice model via xinference, dify use it as text to speech model. No matter what I configure the voice for Text to Speech, the value of the voice passed from dify to xinference is always empty

{'intput': '记得联系技术支持寻求进一步的帮助。', 'voice': '', 'response_format': 'mp3', 'speed': 1.0, 'stream': False}

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 1 month ago

Hey @cycwll! I'm here to help you with any bugs, questions, or contributions you have for the Dify repository. Let's work together to solve any issues you're encountering.

To resolve the issue where the voice parameter is always empty when passed from Dify to xinference in your self-hosted Dify version 0.6.16 running on Docker, you need to ensure that the voice parameter is correctly handled and passed through the relevant functions and API calls.

From the provided code, it appears that the XinferenceModelExtraParameter class and related methods do not include a voice parameter. You will need to modify the code to include this parameter where necessary.

Here is an example of how you can modify the XinferenceModelExtraParameter class and the related methods to include the voice parameter:

Update the XinferenceModelExtraParameter class to include the voice parameter:

class XinferenceModelExtraParameter:
    model_format: str
    model_handle_type: str
    model_ability: list[str]
    max_tokens: int = 512
    context_length: int = 2048
    support_function_call: bool = False
    support_vision: bool = False
    voice: str = ""  # Add the voice parameter

    def __init__(self, model_format: str, model_handle_type: str, model_ability: list[str],
                 support_function_call: bool, support_vision: bool, max_tokens: int, context_length: int, voice: str) -> None:
        self.model_format = model_format
        self.model_handle_type = model_handle_type
        self.model_ability = model_ability
        self.support_function_call = support_function_call
        self.support_vision = support_vision
        self.max_tokens = max_tokens
        self.context_length = context_length
        self.voice = voice  # Initialize the voice parameter

Update the _get_xinference_extra_parameter method to extract the voice parameter from the response:

@staticmethod
def _get_xinference_extra_parameter(server_url: str, model_uid: str) -> XinferenceModelExtraParameter:
    if not model_uid or not model_uid.strip() or not server_url or not server_url.strip():
        raise RuntimeError('model_uid is empty')

    url = str(URL(server_url) / 'v1' / 'models' / model_uid)

    session = Session()
    session.mount('http://', HTTPAdapter(max_retries=3))
    session.mount('https://', HTTPAdapter(max_retries=3))

    try:
        response = session.get(url, timeout=10)
    except (MissingSchema, ConnectionError, Timeout) as e:
        raise RuntimeError(f'get xinference model extra parameter failed, url: {url}, error: {e}')
    if response.status_code != 200:
        raise RuntimeError(f'get xinference model extra parameter failed, status code: {response.status_code}, response: {response.text}')

    response_json = response.json()
    model_format = response_json.get('model_format', 'ggmlv3')
    model_ability = response_json.get('model_ability', [])
    voice = response_json.get('voice', '')  # Extract the voice parameter

    if response_json.get('model_type') == 'embedding':
        model_handle_type = 'embedding'
    elif model_format == 'ggmlv3' and 'chatglm' in response_json['model_name']:
        model_handle_type = 'chatglm'
    elif 'generate' in model_ability:
        model_handle_type = 'generate'
    elif 'chat' in model_ability:
        model_handle_type = 'chat'
    else:
        raise NotImplementedError(f'xinference model handle type {model_handle_type} is not supported')

    support_function_call = 'tools' in model_ability
    support_vision = 'vision' in model_ability
    max_tokens = response_json.get('max_tokens', 512)
    context_length = response_json.get('context_length', 2048)

    return XinferenceModelExtraParameter(
        model_format=model_format,
        model_handle_type=model_handle_type,
        model_ability=model_ability,
        support_function_call=support_function_call,
        support_vision=support_vision,
        max_tokens=max_tokens,
        context_length=context_length,
        voice=voice  # Pass the voice parameter
    )

Additionally, ensure that your Dify Docker deployment is correctly configured to access the Xinference endpoint. According to the Dify documentation, you need to set the server URL to the machine's IP address instead of localhost:

使用 Dify Docker 部署方式的需要注意网络配置，确保 Dify 容器可以访问到 Xinference 的端点，Dify 容器内部无法访问到 localhost，需要使用宿主机 IP 地址。

By making these changes, you ensure that the voice parameter is included in the XinferenceModelExtraParameter class and is correctly extracted from the response and passed through the relevant methods. Additionally, ensure that your network settings allow Dify to access the Xinference endpoint ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

ic-xu commented 1 month ago

Xinference not tts model ?

cycwll commented 1 month ago

My dify running by docker compose。The network from dify to xinference is not problem. The interface can be called and text can be converted to speech, but only the voice parameter has not been passed to xinference.

I entered the docker-api-1 and docker-worker-1 container, and modify the file /app/api/core/model_runtime/model_providers/xinference/xinference_helper.py according to your guidance.
And then, restarted the docker-api-1 and docker-worker-1 container by: docker restart docker-api-1
After restarting, I confirmed that the files inside the container were still modified files

I see the log of xinference, the voice parameter still is empty. 2024-08-07 01:02:20,162 xinference.core.model 479 DEBUG Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7fdad4bc0a40>,), kwargs: {'input': '请联系技术支持团队获取帮助。', 'voice': '', 'response_format': 'mp3', 'speed': 1.0, 'stream': False}

cycwll commented 4 weeks ago

@dosu thanks for you help!

I upgraded to v0.7.0, but the situation has become worse. The "voice" parameter has been became to "default", and causing the xinference error:

xinference.core.model 479 DEBUG Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7fdad4bc0a40>,), kwargs: {'input': '请联系技术支持团队获取帮助。', 'voice': 'default', 'response_format': 'mp3', 'speed': 1.0, 'stream': False}

"AssertionError: [address=0.0.0.0:43313, pid=467] Invalid voice default, CosyVoice available speakers: ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']"

dosubot[bot] commented 4 weeks ago

To ensure that the correct "voice" parameter is passed to xinference in your self-hosted Dify setup (version 0.7.0, running on Docker), you need to make sure that the voice parameter is correctly set and validated before invoking the text-to-speech model. Here is the relevant part of the code that ensures the correct voice parameter is used:

def _tts_invoke_streaming(self, model: str, credentials: dict, content_text: str, voice: str) -> any:
    """
    _tts_invoke_streaming text2speech model

    :param model: model name
    :param credentials: model credentials
    :param content_text: text content to be translated
    :param voice: model timbre
    :return: text translated to audio file
    """
    if credentials['server_url'].endswith('/'):
        credentials['server_url'] = credentials['server_url'][:-1]

    try:
        handle = RESTfulAudioModelHandle(credentials['model_uid'], credentials['server_url'], auth_headers={})
        model_support_voice = [x.get("value") for x in self.get_tts_model_voices(model=model, credentials=credentials)]
        if not voice or voice not in model_support_voice:
            voice = self._get_model_default_voice(model, credentials)
        word_limit = self._get_model_word_limit(model, credentials)
        if len(content_text) > word_limit:
            sentences = self._split_text_into_sentences(content_text, max_length=word_limit)
            executor = concurrent.futures.ThreadPoolExecutor(max_workers=min(3, len(sentences)))
            futures = [executor.submit(
                handle.speech,
                input=sentences[i],
                voice=voice,
                response_format="mp3",
                speed=1.0,
                stream=False
            ) for i in range(len(sentences))]

            for index, future in enumerate(futures):
                response = future.result()
                for i in range(0, len(response), 1024):
                    yield response[i:i + 1024]
        else:
            response = handle.speech(
                input=content_text.strip(),
                voice=voice,
                response_format="mp3",
                speed=1.0,
                stream=False
            )

            for i in range(0, len(response), 1024):
                yield response[i[i + 1024]
    except Exception as ex:
        raise InvokeBadRequestError(str(ex))

To avoid the error where the "voice" parameter is set to "default" instead of the specified value, ensure that:

The voice parameter is correctly passed to the _tts_invoke_streaming method.
The voice parameter is included in the list of supported voices for the model.

If the voice parameter is not specified or is not in the list of supported voices, the code defaults to the model's default voice:

if not voice or voice not in model_support_voice:
    voice = self._get_model_default_voice(model, credentials)

Make sure that the voice parameter you are passing is valid and supported by the model. You can check the supported voices using the get_tts_model_voices method:

def get_tts_model_voices(self, model: str, credentials: dict, language: Optional[str] = None) -> list:
    audio_model_name = credentials.get('audio_model_name', '__default')
    for key, voices in self.model_voices.items():
        if key in audio_model_name:
            if language and language in voices:
                return voices[language]
            elif 'all' in voices:
                return voices['all']

    return self.model_voices['__default']['all']

By ensuring the voice parameter is valid and supported, you can avoid the issue of it defaulting to "default" ^[1].

To continue talking to Dosu, mention @dosu.

cycwll commented 4 weeks ago

@dosu. After adding the "function" in my application, I did not set the voice, and the default voice I used would be "voice": "default". And then I set the voice to "Chinese" - "Chinese male", but the problem goes back to the beginning and "voice" is empty {'intput': '记得联系技术支持寻求进一步的帮助。', 'voice': '', 'response_format': 'mp3', 'speed': 1.0, 'stream': False}

Additionally Please help me confirm if my following modifications are a valid verification method

I entered the docker-api-1 and docker-worker-1 container, and modify the file /app/api/core/model_runtime/model_providers/xinference/xinference_helper.py according to your guidance.
And then, restarted the docker-api-1 and docker-worker-1 container by: docker restart docker-api-1
After restarting, I confirmed that the files inside the container were still modified files

ic-xu commented 4 weeks ago

Can you give me a temporary Xinference key to verify?

cycwll commented 4 weeks ago

@ic-xu My xinference is deployed locally

hunterhome commented 1 week ago

文字转语音自动播放开启在xinference控制台出现：

Invalid voice default, CosyVoice available speakers: ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']"

crazywoola commented 1 week ago

Should be fixed in #7958

langgenius / dify