langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
46.42k stars 6.55k forks source link

tts cannot switch timbre #8655

Open zscxc opened 2 days ago

zscxc commented 2 days ago

Self Checks

Dify version

0.8.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

image Open and use the male voice, but click to play is not a male voice but a female voice, and this switch is still a Chinese female voice, is there any place can be adjusted. The model I used was CosyVoice-300M-Instruct image image

✔️ Expected Behavior

I want to be able to adjust the sound when calling through the API

❌ Actual Behavior

No matter how you switch the timbre, it doesn't work

dosubot[bot] commented 2 days ago

Hello, @zscxc! I'm here to help you with any bugs, questions, or contributions you might have. Ask me anything! 😊

The issue you're experiencing with the timbre switching for the CosyVoice-300M-Instruct model might be related to how the payload is constructed and sent to the API. The logic for handling different voice types is implemented in the _build_tts_payload method. This method constructs the payload based on the model_type parameter, which can be one of the following: PresetVoice, CloneVoice, CloneVoice_CrossLingual, or InstructVoice.

Here is the relevant code snippet:

def _build_tts_payload(
        self,
        model_type: str,
        content_text: str,
        model_role: str,
        prompt_text: str,
        prompt_audio: str,
        instruct_text: str,
    ):
        if model_type == TTSModelType.PresetVoice.value and model_role:
            return {"tts_text": content_text, "role": model_role}
        if model_type == TTSModelType.CloneVoice.value and prompt_text and prompt_audio:
            return {"tts_text": content_text, "prompt_text": prompt_text, "prompt_audio": prompt_audio}
        if model_type == TTSModelType.CloneVoice_CrossLingual.value and prompt_audio:
            lang_tag = self._detect_lang_code(content_text)
            return {"tts_text": f"{content_text}", "prompt_audio": prompt_audio, "lang_tag": lang_tag}
        if model_type == TTSModelType.InstructVoice.value and instruct_text and model_role:
            return {"tts_text": content_text, "role": model_role, "instruct_text": instruct_text}

        raise RuntimeError(f"Invalid params for {model_type}")

This method is called within the _invoke method to generate the appropriate payload for the text-to-speech request [1].

To troubleshoot the issue, you can check the following:

  1. Ensure that the model_type parameter is correctly set to the desired voice type.
  2. Verify that the other parameters (model_role, prompt_text, prompt_audio, instruct_text) are correctly provided based on the selected model_type.

If the parameters are correctly set and the issue persists, there might be a bug in the implementation or a misconfiguration in the model settings. You may need to review the configuration files and ensure that the model is correctly initialized and configured to handle the specified voice types.

To continue talking to Dosu, mention @dosu.

zscxc commented 2 days ago

Should I implement this timbre logic in dify

dosubot[bot] commented 2 days ago

The timbre switching logic is already implemented in Dify within the _build_tts_payload method. This method constructs the payload based on the model_type parameter and other related parameters like model_role, prompt_text, prompt_audio, and instruct_text [1].

To ensure the timbre switching works correctly, you should:

  1. Verify that the model_type parameter is set to the correct voice type (e.g., PresetVoice, CloneVoice, CloneVoice_CrossLingual, or InstructVoice).
  2. Ensure that the other parameters (model_role, prompt_text, prompt_audio, instruct_text) are correctly provided based on the selected model_type.

If these parameters are correctly set and the issue persists, it might be due to a bug or misconfiguration. You should review the configuration files and ensure the model is correctly initialized and configured to handle the specified voice types. The configuration file sagemaker_tts.yaml specifies the available voice types and their corresponding parameters [2].

If everything seems correct and the issue still exists, you might need to debug the payload construction and API request handling to identify any discrepancies.

To continue talking to Dosu, mention @dosu.