livekit / agents

Build real-time multimodal AI applications 🤖🎙️📹
https://docs.livekit.io/agents
Apache License 2.0
4.02k stars 418 forks source link

[Very Important for LiveKit] Request add general mechanism to customize plugins of VoiceAssistant ASAP #821

Closed taylorgwei closed 1 month ago

taylorgwei commented 1 month ago

Livekit bring very good RTC to world with OpenSource or Cloud, Awesome! But Livekit Agent has one big problem:

The Livekit' VoiceAssistant ' Pipeline are hardcoded as combining VAD+STT+LLM+TTS ,which is pretty hard to customize it or bring a lot problems if everyone want to add extra/remove some plugin into the pipeline. The following cases are FAILED at 100%: if just need VAD+STT+LLM(remove TTS) ,VoiceAssistant may crash if remove VAD plugin(but keep others), VoiceAssistant may crash if just organized like VAD + Multimodal, VoiceAssistant may crash if want to extra process after TTS, there are no way insert a plugin into pipeline at end if want to customize chat_ctx dynamically ,it is complexity with a lot of code change

The related hard-code are here : assistant = VoiceAssistant( vad=ctx.proc.userdata["vad"], stt=deepgram.STT(), llm=openai.LLM(), tts=openai.TTS(), chat_ctx=initial_ctx, )

Result: right now Agent'Framework are good for Demo but not for product because every customer have very specific demands ,which ask general and easy way to customize flow.

Expect: VoiceAssistant should be a general pipeline framework, just manage data flow(txt,voice) between plugins and connect every plugin to finish a task. NOT depends type/purpose of plugin or how plugin work, NOT matter how plugin-inside logic

BTW: Latest version seems to be better than before by spliting VoiceAssistant to Pipeline concept ,but still hardcode inside.

davidzhao commented 1 month ago

Hey taylor,

Appreciate the feedback. I'd love to hear any use cases that you've tried to implement and found there to be limitations.

We had considered the approach earlier on and have decided against building a GStreamer-like abstraction. Here are the main reasons why:

Instead of that, we offer a few fairly straight forward ways to control the conversation flow:

In summary, we came to the opposite conclusion as you did: to do something extremely well, it's better to focus on that core use case directly, rather than an abstraction layer that could produce the use case. That is what we've done with VoiceAssistant. This is powering many production applications and I disagree with the assessment that it is not customizable to specific needs.

With that said, we welcome different approaches to the same problem. We'd definitely like to make it more customizable if you have a use case that you are finding it difficult to build for with the current hooks in place.

jezell commented 1 month ago

Another option if you want to turn some things off like TTS or STT is just to make a provider that doesn't do anything with the data. All the current providers deal with audio frames, but it's not part of the required surface area if you want to send some stuff to null or generate data from some other source.

taylorgwei commented 1 month ago

@davidzhao Here not asking to have very high abstract layer like media processing of GStream. Just let every plugin follow the standard interface(text and voice data ' format as well) of input and output, and allow connect arbitrary plugins.

Anyway, just our kind feedback. and also thanks @jezell provide a workaround for that

davidzhao commented 1 month ago

@taylorgwei I appreciate the feedback. Do you have a use case in mind that you are finding difficult to accomplish?