huggingface / speech-to-speech

Speech To Speech: an effort for an open-sourced and modular GPT4-o
Apache License 2.0
3.5k stars 365 forks source link

Latency Optimization for Speech-to-Speech Pipeline #107

Open yatharthk2 opened 1 month ago

yatharthk2 commented 1 month ago

Hi,

I am currently running the speech-to-speech pipeline on an AWS EC2 instance (Ubuntu 20.04) with an Nvidia A10g GPU. The pipeline works well, but I am experiencing around 1 second of latency, and I am particularly interested in improving the latency of the entire speech-to-speech pipeline, especially the Text to Speech (TTS) part.

Current Setup: EC2 Instance: Nvidia A10g GPU, 24GB GPU RAM OS: Ubuntu 20.04 GPU Driver: NVIDIA-SMI 470.141.03 CUDA Version: 12.2 Pipeline: Using the standard setup from your repo STT Model: Whisper large-v2 TTS Model: Parler-TTS (default)

Problem: I’m currently facing around 1 second of latency for the entire pipeline from speech input to speech output. While the STT part works fairly well, the TTS step seems to contribute most to the latency. I would greatly appreciate any suggestions or guidance on reducing the overall latency, particularly for TTS.

Thanks!

sandorkonya commented 1 month ago

Take a look at this . The proposed method increases the TTS part. Also it is mentioned there, that 500 ms is appended after the last chunk, which means, that 500 ms is the delay until the beginning of the LLM --> TTS steps.

yatharthk2 commented 1 month ago

Thank you for sharing the link and your discussions with the author. I understand the role of Whisper Streamer in accelerating the text-to-speech process. I also recognize the 500ms latency in ParlerTTS, but I believe I am not achieving this latency. Is there any way I can optimize Parler TTS setup to reach the 500ms target?

yatharthk2 commented 1 month ago

I am trying to make this pipeline really fast. I tried integrating styleTTS ; seems like streaming is not compatible with StyleTTS as of Now, how would you approach the latency optimization?