tts cannot switch timbre

zscxc commented 2 days ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Open and use the male voice, but click to play is not a male voice but a female voice, and this switch is still a Chinese female voice, is there any place can be adjusted. The model I used was CosyVoice-300M-Instruct

✔️ Expected Behavior

I want to be able to adjust the sound when calling through the API

❌ Actual Behavior

No matter how you switch the timbre, it doesn't work

dosubot[bot] commented 2 days ago

Hello, @zscxc! I'm here to help you with any bugs, questions, or contributions you might have. Ask me anything! 😊

The issue you're experiencing with the timbre switching for the CosyVoice-300M-Instruct model might be related to how the payload is constructed and sent to the API. The logic for handling different voice types is implemented in the _build_tts_payload method. This method constructs the payload based on the model_type parameter, which can be one of the following: PresetVoice, CloneVoice, CloneVoice_CrossLingual, or InstructVoice.

Here is the relevant code snippet:

def _build_tts_payload(
        self,
        model_type: str,
        content_text: str,
        model_role: str,
        prompt_text: str,
        prompt_audio: str,
        instruct_text: str,
    ):
        if model_type == TTSModelType.PresetVoice.value and model_role:
            return {"tts_text": content_text, "role": model_role}
        if model_type == TTSModelType.CloneVoice.value and prompt_text and prompt_audio:
            return {"tts_text": content_text, "prompt_text": prompt_text, "prompt_audio": prompt_audio}
        if model_type == TTSModelType.CloneVoice_CrossLingual.value and prompt_audio:
            lang_tag = self._detect_lang_code(content_text)
            return {"tts_text": f"{content_text}", "prompt_audio": prompt_audio, "lang_tag": lang_tag}
        if model_type == TTSModelType.InstructVoice.value and instruct_text and model_role:
            return {"tts_text": content_text, "role": model_role, "instruct_text": instruct_text}

        raise RuntimeError(f"Invalid params for {model_type}")

This method is called within the _invoke method to generate the appropriate payload for the text-to-speech request ^[1].

To troubleshoot the issue, you can check the following:

Ensure that the model_type parameter is correctly set to the desired voice type.
Verify that the other parameters (model_role, prompt_text, prompt_audio, instruct_text) are correctly provided based on the selected model_type.

If the parameters are correctly set and the issue persists, there might be a bug in the implementation or a misconfiguration in the model settings. You may need to review the configuration files and ensure that the model is correctly initialized and configured to handle the specified voice types.

To continue talking to Dosu, mention @dosu.

zscxc commented 2 days ago

Should I implement this timbre logic in dify

dosubot[bot] commented 2 days ago

The timbre switching logic is already implemented in Dify within the _build_tts_payload method. This method constructs the payload based on the model_type parameter and other related parameters like model_role, prompt_text, prompt_audio, and instruct_text ^[1].

To ensure the timbre switching works correctly, you should:

Verify that the model_type parameter is set to the correct voice type (e.g., PresetVoice, CloneVoice, CloneVoice_CrossLingual, or InstructVoice).
Ensure that the other parameters (model_role, prompt_text, prompt_audio, instruct_text) are correctly provided based on the selected model_type.

If these parameters are correctly set and the issue persists, it might be due to a bug or misconfiguration. You should review the configuration files and ensure the model is correctly initialized and configured to handle the specified voice types. The configuration file sagemaker_tts.yaml specifies the available voice types and their corresponding parameters ^[2].

If everything seems correct and the issue still exists, you might need to debug the payload construction and API request handling to identify any discrepancies.

To continue talking to Dosu, mention @dosu.

langgenius / dify