Closed 401557122 closed 1 month ago
The issue you're experiencing with the text-to-speech (TTS) functionality in your self-hosted Dify version 0.6.16 is likely due to the way the TTS model handles long texts. The XinferenceText2SpeechModel
class in the tts.py
file has a method _tts_invoke_streaming
that processes the text for TTS. This method includes a word limit and splits the text into sentences if it exceeds this limit.
To resolve this issue, you can adjust the word limit or modify the way the text is split and processed. Here is the relevant part of the code:
class XinferenceText2SpeechModel(TTSModel):
def _tts_invoke_streaming(self, model: str, credentials: dict, content_text: str, voice: str) -> any:
if credentials['server_url'].endswith('/'):
credentials['server_url'] = credentials['server_url'][:-1]
try:
handle = RESTfulAudioModelHandle(credentials['model_uid'], credentials['server_url'], auth_headers={})
model_support_voice = [x.get("value") for x in self.get_tts_model_voices(model=model, credentials=credentials)]
if not voice or voice not in model_support_voice:
voice = self._get_model_default_voice(model, credentials)
word_limit = self._get_model_word_limit(model, credentials)
if len(content_text) > word_limit:
sentences = self._split_text_into_sentences(content_text, max_length=word_limit)
executor = concurrent.futures.ThreadPoolExecutor(max_workers=min(3, len(sentences)))
futures = [executor.submit(
handle.speech,
input=sentences[i],
voice=voice,
response_format="mp3",
speed=1.0,
stream=False
) for i in range(len(sentences))]
for index, future in enumerate(futures):
response = future.result()
for i in range(0, len(response), 1024):
yield response[i:i + 1024]
else:
response = handle.speech(
input=content_text.strip(),
voice=voice,
response_format="mp3",
speed=1.0,
stream=False
)
for i in range(0, len(response), 1024):
yield response[i:i + 1024]
except Exception as ex:
raise InvokeBadRequestError(str(ex))
To accommodate longer texts, you can increase the word limit or improve the text splitting logic. For example, you can set a higher word limit:
def _get_model_word_limit(self, model: str, credentials: dict) -> int:
return 5000 # Increased word limit
Additionally, ensure that the text splitting logic in _split_text_into_sentences
is efficient and handles longer texts appropriately. This change should allow the TTS functionality to process and return more complete audio messages for longer texts [1][2].
To continue talking to Dosu, mention @dosu.
It doesn't seem like this is the problem, because the length of my text is not greater than the limit. When I checked the backend, I found that the text output by the model is divided into many batches, and TTS is called for each batch, which returns a result. However, the frontend only displays tts_message_dend for the audio encoding of the first batch. I suspect that this may be due to multiple batch processing or setting a timeout period that forcibly terminates the transmission of TTS, In addition, when I do not enable autoplay and use the text to audio interface, the entire text can be played in its entirety.@dosu
The issue you're describing seems to be related to how the audio messages are being handled and yielded in the _wrapper_process_stream_response
method. To ensure that the entire text is processed and played correctly in autoplay mode, you need to make sure that the audio messages are continuously checked and yielded until the status is "finish".
Here is the relevant part of the code that handles the audio messages:
def _wrapper_process_stream_response(self, trace_manager: Optional[TraceQueueManager] = None) -> Generator[StreamResponse, None, None]:
tenant_id = self._application_generate_entity.app_config.tenant_id
task_id = self._application_generate_entity.task_id
publisher = None
text_to_speech_dict = self._app_config.app_model_config_dict.get('text_to_speech')
if text_to_speech_dict and text_to_speech_dict.get('autoPlay') == 'enabled' and text_to_speech_dict.get('enabled'):
publisher = AppGeneratorTTSPublisher(tenant_id, text_to_speech_dict.get('voice', None))
for response in self._process_stream_response(publisher=publisher, trace_manager=trace_manager):
while True:
audio_response = self._listenAudioMsg(publisher, task_id)
if audio_response:
yield audio_response
else:
break
yield response
start_listener_time = time.time()
# timeout
while (time.time() - start_listener_time) < TTS_AUTO_PLAY_TIMEOUT:
if publisher is None:
break
audio = publisher.checkAndGetAudio()
if audio is None:
# release cpu
# sleep 20 ms ( 40ms => 1280 byte audio file,20ms => 640 byte audio file)
time.sleep(TTS_AUTO_PLAY_YIELD_CPU_TIME)
continue
if audio.status == "finish":
break
else:
start_listener_time = time.time()
yield MessageAudioStreamResponse(audio=audio.audio, task_id=task_id)
yield MessageAudioEndStreamResponse(audio='', task_id=task_id)
To ensure the entire text is processed and played correctly, you should verify the following:
Additionally, the frontend implementation in audio.ts
should be reviewed to ensure it correctly handles the streaming audio data. Here are some key points from the AudioPlayer
class in audio.ts
:
MediaSource
is correctly initialized and compatible with the browser.loadAudio
method reads the audio stream and appends the data to the sourceBuffer
. Ensure that this method correctly handles the streaming data and appends it to the buffer.receiveAudioData
method manages the sourceBuffer
and cacheBuffers
. Ensure that this method correctly appends the audio data to the buffer and handles the end of the stream properly.Here is a snippet from audio.ts
that handles the audio streaming:
private async loadAudio() {
try {
const audioResponse: any = await textToAudioStream(this.url, this.isPublic, { content_type: 'audio/mpeg' }, {
message_id: this.msgId,
streaming: true,
voice: this.voice,
text: this.msgContent,
});
if (audioResponse.status !== 200) {
this.isLoadData = false;
if (this.callback)
this.callback('error');
}
const reader = audioResponse.body.getReader();
while (true) {
const { value, done } = await reader.read();
if (done) {
this.receiveAudioData(value);
break;
}
this.receiveAudioData(value);
}
} catch (error) {
this.isLoadData = false;
this.callback && this.callback('error');
}
}
private receiveAudioData(unit8Array: Uint8Array) {
if (!unit8Array) {
this.finishStream();
return;
}
const audioData = this.byteArrayToArrayBuffer(unit8Array);
if (!audioData.byteLength) {
if (this.mediaSource?.readyState === 'open')
this.finishStream();
return;
}
if (this.sourceBuffer?.updating) {
this.cacheBuffers.push(audioData);
} else {
if (this.cacheBuffers.length && !this.sourceBuffer?.updating) {
this.cacheBuffers.push(audioData);
const cacheBuffer = this.cacheBuffers.shift()!;
this.sourceBuffer?.appendBuffer(cacheBuffer);
} else {
this.sourceBuffer?.appendBuffer(audioData);
}
}
}
Ensure that the loadAudio
and receiveAudioData
methods are correctly handling the streaming data and appending it to the sourceBuffer
. This will help in processing and playing the entire text correctly in autoplay mode [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Self Checks
Dify version
0.6.16
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
I deployed chatTTS using xinference and set it to autoplay on dify. I found that on the agent's debugging page, short sentences can be autoplayed. However, when the generated text is long or occasionally appears, the returned tts message may not be the audio of all the text content, but may only be the first two sentences
✔️ Expected Behavior
No response
❌ Actual Behavior
No response