huggingface / speech-to-speech

Speech To Speech: an effort for an open-sourced and modular GPT4-o
Apache License 2.0
3.09k stars 326 forks source link

Voice Breaks and Latency in Text-to-Speech Conversion #108

Open Devloper-RG opened 1 week ago

Devloper-RG commented 1 week ago

I'm experiencing issues with breaks in the generated voice output, seemingly caused by latency in the text-to-speech (TTS) conversion process. The audio output has occasional breaks, which disrupt the flow of speech.

Steps I've tried:

Decreasing block size: This helped reduce some latency in delivering TTS audio output, but the issue persists. Adjusting play_steps_s: I've decreased this parameter to minimize latency. However, setting play_steps_s below 0.5 causes errors, so I’ve kept it at 0.5 for now.

Any suggestions on how to further reduce the latency and improve the smoothness of the audio output would be greatly appreciated.

eustlb commented 1 week ago

Hey @Devloper-RG, On what device are you running the pipeline?

Devloper-RG commented 1 week ago

@eustlb I'm running the server on a Google Cloud Platform (GCP) VM with 2 NVIDIA T4 GPUs, and the client is on my local machine.

eustlb commented 1 week ago

I never tried this setup, there are two possibilities for choppy audio:

  1. the connection between the server and the client is not fast enough (see this related issue) → we are working on switching from TCP to UDP to have faster audio packet transfer.
  2. it might be that a T4 is not enough to generate 43 tokens (play_step_s of 0.5) and to run DAC decoding in less than 0.5 seconds (so the time of the audio chunk) → we are working on enabling more performant torch compile modes (i.e. the ones that captures cuda graphs: reduce-overhead and max-autotune) with Parler-TTS + streaming that could make it work on a T4. Here what you can try is increase play_steps_s → you'll increase latency yet you'll also reduce the number of DAC decoding steps

What you can do is switching from Parler-TTS to MeloTTS by setting the --tts melo flag. You'll loose text-to-speech generation streaming that will increase latency, but also remove this point as a possibility for choppy audio

  1. If you still experience choppy audio, reason 1. given above is responsible.
  2. If not, then Parler-TTS is responsible. Increase play_steps_suntil you do not experience choppy audio anymore.

Also, can you give me the command you're running?

andimarafioti commented 1 week ago

Also, beware that I don't think the code uses multiple gpus yet. So 2 T4s is the same as 1.

Devloper-RG commented 1 week ago

@eustlb I'll implement the solutions you suggested and will update you if they work or if I find the underlying issue. As requested, here are the terminal commands I used: Client side: python listen_and_play.py --host <IP address> Server side: python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0