huggingface / speech-to-speech

Speech To Speech: an effort for an open-sourced and modular GPT4-o
Apache License 2.0
3.51k stars 365 forks source link

Audio comes out choppy #24

Open AznamirWoW opened 2 months ago

AznamirWoW commented 2 months ago

is there any way to improve the audio output?

USER: Hello? ASSISTANT: Hello there! 2024-08-20 19:17:38,156 - main - INFO - Time to first audio: 1.694 ASSISTANT: How can I assist you today?

Audio: H-ow-ca-nai-a-ssi-st-yo-u-to-day?

(Using Zluda emulator and RX 6700XT GPU, had to downgrade torch to 2.3.0)

andimarafioti commented 2 months ago

That happens if your GPU is not fast enough for the Parler generation, sadly we tested this with quite a beefy setup and optimized a bit for that. On the mps support branch, I included support for melotts. You need to install it manually but on the branch there are instructions, and then you can pass '--tts melo' to the generation script and use it instead. It's quite a bit smaller so the audio should not be choppy anymore. Let me know if you do and how it works!

rs545837 commented 2 months ago

yup, it is choppy, even when i used H100 on runpod, the way I was able to resolve it a bit was by increasing the value of play_steps_s to 1.0, it should help.

andimarafioti commented 2 months ago

I forgot to mention that can also help. It was also choppy on an H100? That's unexpected! Maybe the connection latency is a bit high there? We tested with a 4090 but it was in our network.

rs545837 commented 2 months ago

Yeah within the same network, will not have network latency and makes a huge difference.

jasonngap1 commented 2 months ago

I forgot to mention that can also help. It was also choppy on an H100? That's unexpected! Maybe the connection latency is a bit high there? We tested with a 4090 but it was in our network.

I also tested on RTX4090 and the audio came out choppy. I have checked the GPU usage and it is under the max of the 4090 card.

AznamirWoW commented 2 months ago

yup, it is choppy, even when i used H100 on runpod, the way I was able to resolve it a bit was by increasing the value of play_steps_s to 1.0, it should help.

That did help to remove most choppiness, still some remains. I have not noticed much load on GPU, 6700XT does perfectly fine job.

andimarafioti commented 2 months ago

@jasonngap1 , that's interesting, could you share more details of your setup? If I can figure out what might be happening, I could fix it :)

codearranger commented 2 months ago

I think the issue could be because of the use of TCP. UDP/RTP streams would be ideal for audio streams.

I am having the same issues running on a 36 core Xeon with an RTX 4090 and 256GB of system ram.

rs545837 commented 2 months ago

yes exactly using tcp does cause latency issues, udp is a better option, using webrtc might be a solution. Plus, the lag is specifically in the tts generated audio transmission.

jasonngap1 commented 2 months ago

I think the issue could be because of the use of TCP. UDP/RTP streams would be ideal for audio streams.

I am having the same issues running on a 36 core Xeon with an RTX 4090 and 256GB of system ram.

My system specs is similar to yours. Strange thing is that when I load in Windows the audio comes out fine. There were much hallucination in the STT transcription but that's another issue.

andimarafioti commented 2 months ago

For the STT transcription, I found some systems work better than others. For my with the laptop it works really well but with my airpods it creates a lot of hallucinations

rs545837 commented 2 months ago

yeah there might also be a need to get a more robust VAD model, sometimes it even misses the first few spoken words, I also tried to change the parameters for VAD sensitivity, but didn't get much of a difference.

andimarafioti commented 2 months ago

I changed the default speech_pad_ms to 500ms to try to get whisper to transcribe audio still even when the VAD fails.

rs545837 commented 2 months ago

Btw, here are some of my learnings to share:

Switching to a socket-based server using UDP could be a great approach to remove audio choppiness and reduce latency. However, packet transfer handling needs to be robust, as I experienced some noise/choppiness while transferring a large WAV file. I think transferring very small chunks and stitching them up on the client side could be a great alternative to eliminate noise.

Now, with udp I am able to receive a 5 sec audio(about 80 characters) in about 0.6 seconds, which includes the time to send the text+time to generate speech+time to get back the audio

andimarafioti commented 2 months ago

I didn't look at the socket code too much, is it TCP by default? can we change it to be UDP?

codearranger commented 2 months ago

Typically with VoIP protocols we use RTP which is a UDP protocol, and the payload is usually 20ms of audio every 20ms. Because it's UDP, packets can arrive out of order so you need a jitter buffer on the receiving end to re-order the packets. Each packet has a sequence number and a timestamp among a few other things.

eustlb commented 2 months ago

Hey there, I am looking into improving the client-server connection set up. Indeed UDP seems like a more suitable option yet handling lost and out of order packets is not trivial. I am not finding any good RTP implementation to build upon for this (found this one, yet necessity to extract the RTP part from it). Do you have a good pointer here @joecryptotoo ?

codearranger commented 2 months ago

@eustlb maybe just hook it up to FreeSWITCH with a custom FS module. Let FS handle the VoIP stuff. Then you would be able to route phone calls to/from the AI agent.

codearranger commented 2 months ago

This module may work.

https://github.com/jambonz/freeswitch-modules/tree/main/mod_audio_fork

codearranger commented 2 months ago

Here we go! https://developer.signalwire.com/freeswitch/FreeSWITCH-Explained/Modules/mod_unimrcp_6586728/