The tts message returned by speech synthesis is incomplete

401557122 commented 1 month ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.16

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I deployed chatTTS using xinference and set it to autoplay on dify. I found that on the agent's debugging page, short sentences can be autoplayed. However, when the generated text is long or occasionally appears, the returned tts message may not be the audio of all the text content, but may only be the first two sentences

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 1 month ago

The issue you're experiencing with the text-to-speech (TTS) functionality in your self-hosted Dify version 0.6.16 is likely due to the way the TTS model handles long texts. The XinferenceText2SpeechModel class in the tts.py file has a method _tts_invoke_streaming that processes the text for TTS. This method includes a word limit and splits the text into sentences if it exceeds this limit.

To resolve this issue, you can adjust the word limit or modify the way the text is split and processed. Here is the relevant part of the code:

class XinferenceText2SpeechModel(TTSModel):

    def _tts_invoke_streaming(self, model: str, credentials: dict, content_text: str, voice: str) -> any:
        if credentials['server_url'].endswith('/'):
            credentials['server_url'] = credentials['server_url'][:-1]

        try:
            handle = RESTfulAudioModelHandle(credentials['model_uid'], credentials['server_url'], auth_headers={})
            model_support_voice = [x.get("value") for x in self.get_tts_model_voices(model=model, credentials=credentials)]
            if not voice or voice not in model_support_voice:
                voice = self._get_model_default_voice(model, credentials)
            word_limit = self._get_model_word_limit(model, credentials)
            if len(content_text) > word_limit:
                sentences = self._split_text_into_sentences(content_text, max_length=word_limit)
                executor = concurrent.futures.ThreadPoolExecutor(max_workers=min(3, len(sentences)))
                futures = [executor.submit(
                    handle.speech,
                    input=sentences[i],
                    voice=voice,
                    response_format="mp3",
                    speed=1.0,
                    stream=False
                ) for i in range(len(sentences))]

                for index, future in enumerate(futures):
                    response = future.result()
                    for i in range(0, len(response), 1024):
                        yield response[i:i + 1024]
            else:
                response = handle.speech(
                    input=content_text.strip(),
                    voice=voice,
                    response_format="mp3",
                    speed=1.0,
                    stream=False
                )

                for i in range(0, len(response), 1024):
                    yield response[i:i + 1024]
        except Exception as ex:
            raise InvokeBadRequestError(str(ex))

To accommodate longer texts, you can increase the word limit or improve the text splitting logic. For example, you can set a higher word limit:

def _get_model_word_limit(self, model: str, credentials: dict) -> int:
    return 5000  # Increased word limit

Additionally, ensure that the text splitting logic in _split_text_into_sentences is efficient and handles longer texts appropriately. This change should allow the TTS functionality to process and return more complete audio messages for longer texts ^[1]^[2].

To continue talking to Dosu, mention @dosu.

401557122 commented 1 month ago

It doesn't seem like this is the problem, because the length of my text is not greater than the limit. When I checked the backend, I found that the text output by the model is divided into many batches, and TTS is called for each batch, which returns a result. However, the frontend only displays tts_message_dend for the audio encoding of the first batch. I suspect that this may be due to multiple batch processing or setting a timeout period that forcibly terminates the transmission of TTS, In addition, when I do not enable autoplay and use the text to audio interface, the entire text can be played in its entirety.@dosu

dosubot[bot] commented 1 month ago

The issue you're describing seems to be related to how the audio messages are being handled and yielded in the _wrapper_process_stream_response method. To ensure that the entire text is processed and played correctly in autoplay mode, you need to make sure that the audio messages are continuously checked and yielded until the status is "finish".

Here is the relevant part of the code that handles the audio messages:

def _wrapper_process_stream_response(self, trace_manager: Optional[TraceQueueManager] = None) -> Generator[StreamResponse, None, None]:
    tenant_id = self._application_generate_entity.app_config.tenant_id
    task_id = self._application_generate_entity.task_id
    publisher = None
    text_to_speech_dict = self._app_config.app_model_config_dict.get('text_to_speech')
    if text_to_speech_dict and text_to_speech_dict.get('autoPlay') == 'enabled' and text_to_speech_dict.get('enabled'):
        publisher = AppGeneratorTTSPublisher(tenant_id, text_to_speech_dict.get('voice', None))
    for response in self._process_stream_response(publisher=publisher, trace_manager=trace_manager):
        while True:
            audio_response = self._listenAudioMsg(publisher, task_id)
            if audio_response:
                yield audio_response
            else:
                break
        yield response

    start_listener_time = time.time()
    # timeout
    while (time.time() - start_listener_time) < TTS_AUTO_PLAY_TIMEOUT:
        if publisher is None:
            break
        audio = publisher.checkAndGetAudio()
        if audio is None:
            # release cpu
            # sleep 20 ms ( 40ms => 1280 byte audio file,20ms => 640 byte audio file)
            time.sleep(TTS_AUTO_PLAY_YIELD_CPU_TIME)
            continue
        if audio.status == "finish":
            break
        else:
            start_listener_time = time.time()
            yield MessageAudioStreamResponse(audio=audio.audio, task_id=task_id)
    yield MessageAudioEndStreamResponse(audio='', task_id=task_id)

To ensure the entire text is processed and played correctly, you should verify the following:

Continuous Audio Chunk Handling: Ensure that the loop continues to check and yield audio chunks until the status is "finish".
Timeout Handling: Make sure the timeout is sufficient to process all audio chunks.
Error Handling: Ensure that any exceptions are logged and handled properly without prematurely breaking the loop.

Additionally, the frontend implementation in audio.ts should be reviewed to ensure it correctly handles the streaming audio data. Here are some key points from the AudioPlayer class in audio.ts:

MediaSource Initialization: Ensure that the MediaSource is correctly initialized and compatible with the browser.
Audio Data Handling: The loadAudio method reads the audio stream and appends the data to the sourceBuffer. Ensure that this method correctly handles the streaming data and appends it to the buffer.
Buffer Management: The receiveAudioData method manages the sourceBuffer and cacheBuffers. Ensure that this method correctly appends the audio data to the buffer and handles the end of the stream properly.

Here is a snippet from audio.ts that handles the audio streaming:

private async loadAudio() {
  try {
    const audioResponse: any = await textToAudioStream(this.url, this.isPublic, { content_type: 'audio/mpeg' }, {
      message_id: this.msgId,
      streaming: true,
      voice: this.voice,
      text: this.msgContent,
    });

    if (audioResponse.status !== 200) {
      this.isLoadData = false;
      if (this.callback)
        this.callback('error');
    }

    const reader = audioResponse.body.getReader();
    while (true) {
      const { value, done } = await reader.read();

      if (done) {
        this.receiveAudioData(value);
        break;
      }

      this.receiveAudioData(value);
    }
  } catch (error) {
    this.isLoadData = false;
    this.callback && this.callback('error');
  }
}

private receiveAudioData(unit8Array: Uint8Array) {
  if (!unit8Array) {
    this.finishStream();
    return;
  }
  const audioData = this.byteArrayToArrayBuffer(unit8Array);
  if (!audioData.byteLength) {
    if (this.mediaSource?.readyState === 'open')
      this.finishStream();
    return;
  }

  if (this.sourceBuffer?.updating) {
    this.cacheBuffers.push(audioData);
  } else {
    if (this.cacheBuffers.length && !this.sourceBuffer?.updating) {
      this.cacheBuffers.push(audioData);
      const cacheBuffer = this.cacheBuffers.shift()!;
      this.sourceBuffer?.appendBuffer(cacheBuffer);
    } else {
      this.sourceBuffer?.appendBuffer(audioData);
    }
  }
}

Ensure that the loadAudio and receiveAudioData methods are correctly handling the streaming data and appending it to the sourceBuffer. This will help in processing and playing the entire text correctly in autoplay mode ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

langgenius / dify