Open AznamirWoW opened 2 months ago
That happens if your GPU is not fast enough for the Parler generation, sadly we tested this with quite a beefy setup and optimized a bit for that. On the mps support branch, I included support for melotts. You need to install it manually but on the branch there are instructions, and then you can pass '--tts melo' to the generation script and use it instead. It's quite a bit smaller so the audio should not be choppy anymore. Let me know if you do and how it works!
yup, it is choppy, even when i used H100 on runpod, the way I was able to resolve it a bit was by increasing the value of play_steps_s
to 1.0, it should help.
I forgot to mention that can also help. It was also choppy on an H100? That's unexpected! Maybe the connection latency is a bit high there? We tested with a 4090 but it was in our network.
Yeah within the same network, will not have network latency and makes a huge difference.
I forgot to mention that can also help. It was also choppy on an H100? That's unexpected! Maybe the connection latency is a bit high there? We tested with a 4090 but it was in our network.
I also tested on RTX4090 and the audio came out choppy. I have checked the GPU usage and it is under the max of the 4090 card.
yup, it is choppy, even when i used H100 on runpod, the way I was able to resolve it a bit was by increasing the value of
play_steps_s
to 1.0, it should help.
That did help to remove most choppiness, still some remains. I have not noticed much load on GPU, 6700XT does perfectly fine job.
@jasonngap1 , that's interesting, could you share more details of your setup? If I can figure out what might be happening, I could fix it :)
I think the issue could be because of the use of TCP. UDP/RTP streams would be ideal for audio streams.
I am having the same issues running on a 36 core Xeon with an RTX 4090 and 256GB of system ram.
yes exactly using tcp does cause latency issues, udp is a better option, using webrtc might be a solution. Plus, the lag is specifically in the tts generated audio transmission.
I think the issue could be because of the use of TCP. UDP/RTP streams would be ideal for audio streams.
I am having the same issues running on a 36 core Xeon with an RTX 4090 and 256GB of system ram.
My system specs is similar to yours. Strange thing is that when I load in Windows the audio comes out fine. There were much hallucination in the STT transcription but that's another issue.
For the STT transcription, I found some systems work better than others. For my with the laptop it works really well but with my airpods it creates a lot of hallucinations
yeah there might also be a need to get a more robust VAD model, sometimes it even misses the first few spoken words, I also tried to change the parameters for VAD sensitivity, but didn't get much of a difference.
I changed the default speech_pad_ms
to 500ms to try to get whisper to transcribe audio still even when the VAD fails.
Btw, here are some of my learnings to share:
I successfully switched to UDP and the results were amazing:
Less than 10ms for a 1-sec WAV file About 10ms for a 4-5 sec WAV file About 20ms for an 11-sec WAV file
Switching to a socket-based server using UDP could be a great approach to remove audio choppiness and reduce latency. However, packet transfer handling needs to be robust, as I experienced some noise/choppiness while transferring a large WAV file. I think transferring very small chunks and stitching them up on the client side could be a great alternative to eliminate noise.
Now, with udp I am able to receive a 5 sec audio(about 80 characters) in about 0.6 seconds, which includes the time to send the text
+time to generate speech
+time to get back the audio
I didn't look at the socket code too much, is it TCP by default? can we change it to be UDP?
Typically with VoIP protocols we use RTP which is a UDP protocol, and the payload is usually 20ms of audio every 20ms. Because it's UDP, packets can arrive out of order so you need a jitter buffer on the receiving end to re-order the packets. Each packet has a sequence number and a timestamp among a few other things.
Hey there, I am looking into improving the client-server connection set up. Indeed UDP seems like a more suitable option yet handling lost and out of order packets is not trivial. I am not finding any good RTP implementation to build upon for this (found this one, yet necessity to extract the RTP part from it). Do you have a good pointer here @joecryptotoo ?
@eustlb maybe just hook it up to FreeSWITCH with a custom FS module. Let FS handle the VoIP stuff. Then you would be able to route phone calls to/from the AI agent.
This module may work.
https://github.com/jambonz/freeswitch-modules/tree/main/mod_audio_fork
is there any way to improve the audio output?
USER: Hello? ASSISTANT: Hello there! 2024-08-20 19:17:38,156 - main - INFO - Time to first audio: 1.694 ASSISTANT: How can I assist you today?
Audio: H-ow-ca-nai-a-ssi-st-yo-u-to-day?
(Using Zluda emulator and RX 6700XT GPU, had to downgrade torch to 2.3.0)