lugia19 / elevenlabslib

Full python wrapper for the elevenlabs API.
MIT License
150 stars 27 forks source link

Langchain stream to ReusableInputStreamerNoPlayback #27

Closed olegchomp closed 5 months ago

olegchomp commented 6 months ago

Hi! Thank you for amazing lib again :) I'm asking for advice on impoving speed generation, i use ReusableInputStreamerNoPlayback with Langchain stream and found that there is zero speed improvements with my pipeline. Looks like it anyway wait for full message and only then start to pushing audio. I think that i missing some important steps, sorry for silly question.

def stream_text():
    for s in chat.stream(messages):
        words = s.content.split()
        for word in words:
            print(word)
            yield word

chat = #initialize llm with langchain

input_streamer_no_playback = ReusableInputStreamerNoPlayback(voice,
                                    generationOptions=GenerationOptions(model='eleven_multilingual_v2', 
                                                                         output_format="pcm_24000"))          
while True:
    user_input = input("Enter something (type 'exit' to end): ")

    audio_future, transcript_future = input_streamer_no_playback.queue_audio(stream_text(user_input))
    audio_queue = audio_future.result()
    transcript_queue = transcript_future.result()

    audio = audio_queue.get()
    while audio is not None:
        voice_queue.put(audio)
        audio = audio_queue.get()
lugia19 commented 6 months ago

A couple things: 1) The websockets will always wait for a certain amount of text to be present before generation begins, which is at least 50. You can control it by setting the chunk_schedule in the WebsocketOptions when you create the input streamer. It defaults to 150 to strike a balance between latency and quality. 2) eleven_multilingual_v2 is the slowest model, and is actually too slow for real time generation. To compensate, the library buffers some of the output, waiting to begin playback so that it does not stutter. 3) You're not setting the latencyOptimizationLevel in the GenerationOptions, which does help.

In short, to absolutely minimize latency:

olegchomp commented 6 months ago

A couple things:

  1. The websockets will always wait for a certain amount of text to be present before generation begins, which is at least 50. You can control it by setting the chunk_schedule in the WebsocketOptions when you create the input streamer. It defaults to 150 to strike a balance between latency and quality.
  2. eleven_multilingual_v2 is the slowest model, and is actually too slow for real time generation. To compensate, the library buffers some of the output, waiting to begin playback so that it does not stutter.
  3. You're not setting the latencyOptimizationLevel in the GenerationOptions, which does help.

In short, to absolutely minimize latency:

  • Use the eleven_turbo_v2 model (or, if you need multilingual, use multilingual v1 until multilingual turbo comes out)
  • Set latencyOptimizationLevel to 3 (4 doesn't help that much and can make the voice sound weird)
  • Set the WebsocketOptions when creating the input streamer to have a chunk_schedule of just [50]

Thank you! Unfortunately language is only represent in eleven_multilingual_v2. Before diving in docs, can Synthesizer be faster in that case? Or it it's totally different pipeline

lugia19 commented 6 months ago

You can try it, but it's unlikely - synthesizer has to wait for the whole text to be present either way, so odds are they'll be equal for short messages, and synthesizer will be worse for longer ones.

Since the language is only available on multi v2, I would recommend making sure the style is set to 0 in the generationOptions (style makes the generation a lot slower) and setting the latencyOptimizationLevel to 3.

Other than that, there isn't much that can be done - from my testing, the above should result in 4-5 seconds of latency before playback begins.