lenML / ChatTTS-Forge

🍦 ChatTTS-Forge is a project developed around TTS generation model, implementing an API Server and a Gradio-based WebUI.
https://huggingface.co/spaces/lenML/ChatTTS-Forge
GNU Affero General Public License v3.0
650 stars 82 forks source link

Missing `prompt1`, `prompt2`, and `prefix` Parameters in `google_text_synthesize` Function of `google_api.py` #62

Closed IrisSally closed 2 months ago

IrisSally commented 2 months ago

Issue Description

Hello,

I have noticed a potential issue in the google_text_synthesize function within the google_api.py module of the ChatTTS-Forge project. Specifically, the ChatTTSConfig object is being instantiated without the prompt1, prompt2, and prefix parameters. The current code snippet is as follows:

tts_config = ChatTTSConfig(
    style=params.get("style", ""),
    temperature=voice.temperature,
    top_k=voice.topK,
    top_p=voice.topP,
)

However, it should include the prompt1, prompt2, and prefix parameters to ensure that the TTS generation can accurately reflect the desired emotional tone and style. The corrected code should look like this:

tts_config = ChatTTSConfig(
    style=params.get("style", ""),
    temperature=voice.temperature,
    top_k=voice.topK,
    top_p=voice.topP,
    prompt1=params.get("prompt1", ""),
    prompt2=params.get("prompt2", ""),
    prefix=params.get("prefix", ""),
)

Without these parameters, downstream components will not be able to access prompt1, prompt2, and prefix, which may result in the generated speech lacking the intended emotional variation corresponding to different styles.

Could you please confirm whether this omission was intentional or an oversight? If it was an oversight, could you update the code to include these parameters?

Thank you for your attention to this matter.

zhzLuke96 commented 2 months ago

Hey, thanks for bringing this up!

You're right, we intentionally left out those parameters. There are two main reasons for this:

  1. prompt1, prompt2, and prefix are actually unofficial prompt engineering techniques. To be honest, they often lead to a decrease in generation quality. We figured most people probably wouldn't use these much in API calls.

  2. In Forge, we've implemented a pretty comprehensive style system. If you want to control the generation style, using this system is more reliable. Just write an appropriate style and specify it in the API call. It usually works better/simple and aligns.

That being said, adding support for more parameters isn't really a big deal. We do plan to make the API support as many features as possible in future updates, including the parameters you mentioned.

IrisSally commented 2 months ago

even if i passed the style param in the google api,it could not work, i tried it on the playground,all the styles were the same voice

zhzLuke96 commented 2 months ago

even if i passed the style param in the google api,it could not work, i tried it on the playground,all the styles were the same voice

"All the styles were the same voice" might just be a perception issue, but the control over speed and intonation should be effective. For instance, if you set the prefix to [speed_5], you should noticeably hear an increase in speed (if there's truly no change, it might be a bug?).

The weak style control ability is actually expected, and further improvements in control capabilities depend on the official release of more components by ChatTTS.

Firstly, the current style (i.e., the prompt1 and prompt2 slots) can be seen as a mechanism similar to the system prompt in ChatGPT. However, the control ability is not strong because ChatTTS has not specifically fine-tuned for these parameters (at least not in the current released version). Therefore, future improvements might depend on the official release of new models or provide LoRA fine-tuning solutions for the community to address.

Secondly, stronger style control relies on the ChatTTS encoder weights, which are also not open-sourced at the moment. As a result, we can only use "text as a system prompt" instead of "audio as a system prompt." Similarly, this also depends on further open-sourcing by the official team.