[Very Important for LiveKit] Request add general mechanism to customize plugins of VoiceAssistant ASAP

taylorgwei commented 1 month ago

Livekit bring very good RTC to world with OpenSource or Cloud, Awesome! But Livekit Agent has one big problem:

The Livekit' VoiceAssistant ' Pipeline are hardcoded as combining VAD+STT+LLM+TTS ,which is pretty hard to customize it or bring a lot problems if everyone want to add extra/remove some plugin into the pipeline. The following cases are FAILED at 100%: if just need VAD+STT+LLM(remove TTS) ,VoiceAssistant may crash if remove VAD plugin(but keep others), VoiceAssistant may crash if just organized like VAD + Multimodal, VoiceAssistant may crash if want to extra process after TTS, there are no way insert a plugin into pipeline at end if want to customize chat_ctx dynamically ,it is complexity with a lot of code change

The related hard-code are here : assistant = VoiceAssistant( vad=ctx.proc.userdata["vad"], stt=deepgram.STT(), llm=openai.LLM(), tts=openai.TTS(), chat_ctx=initial_ctx, )

Result: right now Agent'Framework are good for Demo but not for product because every customer have very specific demands ,which ask general and easy way to customize flow.

Expect: VoiceAssistant should be a general pipeline framework, just manage data flow(txt,voice) between plugins and connect every plugin to finish a task. NOT depends type/purpose of plugin or how plugin work, NOT matter how plugin-inside logic

BTW: Latest version seems to be better than before by spliting VoiceAssistant to Pipeline concept ,but still hardcode inside.

davidzhao commented 1 month ago

Hey taylor,

Appreciate the feedback. I'd love to hear any use cases that you've tried to implement and found there to be limitations.

We had considered the approach earlier on and have decided against building a GStreamer-like abstraction. Here are the main reasons why:

building an interactive application is very different from an media pipeline. it's less generalizable than an encoding flow
for voice AI apps, the orchestration of the flow is typically very specific, and in many cases, tasks may run in parallel versus in a straight pipeline.
we noticed that users often have to fight the pipeline abstractions that were built in place. this is understandable because it's not simple or even possible to create the right level of abstractions on day one.
as models become more sophisticated, there's less of a need to be constructing pipelines, but it'd be nice to keep the same feature set on top of the multimodal models (as is the case with OpenAI's new realtime API)

Instead of that, we offer a few fairly straight forward ways to control the conversation flow:

callback before your result goes into the LLM, before_llm_cb, where the entire conversation context can be modified
callback before LLM result goes into TTS, before_tts_cb, where you can put guardrails around the output, or modify what's said to the user
native function calling
RAG integration

In summary, we came to the opposite conclusion as you did: to do something extremely well, it's better to focus on that core use case directly, rather than an abstraction layer that could produce the use case. That is what we've done with VoiceAssistant. This is powering many production applications and I disagree with the assessment that it is not customizable to specific needs.

With that said, we welcome different approaches to the same problem. We'd definitely like to make it more customizable if you have a use case that you are finding it difficult to build for with the current hooks in place.

jezell commented 1 month ago

Another option if you want to turn some things off like TTS or STT is just to make a provider that doesn't do anything with the data. All the current providers deal with audio frames, but it's not part of the required surface area if you want to send some stuff to null or generate data from some other source.

taylorgwei commented 1 month ago

@davidzhao Here not asking to have very high abstract layer like media processing of GStream. Just let every plugin follow the standard interface(text and voice data ' format as well) of input and output, and allow connect arbitrary plugins.

Anyway, just our kind feedback. and also thanks @jezell provide a workaround for that

davidzhao commented 1 month ago

@taylorgwei I appreciate the feedback. Do you have a use case in mind that you are finding difficult to accomplish?

livekit / agents

[Very Important for LiveKit] Request add general mechanism to customize plugins of VoiceAssistant ASAP #821