Closed taylorgwei closed 1 month ago
Hey taylor,
Appreciate the feedback. I'd love to hear any use cases that you've tried to implement and found there to be limitations.
We had considered the approach earlier on and have decided against building a GStreamer-like abstraction. Here are the main reasons why:
Instead of that, we offer a few fairly straight forward ways to control the conversation flow:
before_llm_cb
, where the entire conversation context can be modifiedbefore_tts_cb
, where you can put guardrails around the output, or modify what's said to the userIn summary, we came to the opposite conclusion as you did: to do something extremely well, it's better to focus on that core use case directly, rather than an abstraction layer that could produce the use case. That is what we've done with VoiceAssistant. This is powering many production applications and I disagree with the assessment that it is not customizable to specific needs.
With that said, we welcome different approaches to the same problem. We'd definitely like to make it more customizable if you have a use case that you are finding it difficult to build for with the current hooks in place.
Another option if you want to turn some things off like TTS or STT is just to make a provider that doesn't do anything with the data. All the current providers deal with audio frames, but it's not part of the required surface area if you want to send some stuff to null or generate data from some other source.
@davidzhao Here not asking to have very high abstract layer like media processing of GStream. Just let every plugin follow the standard interface(text and voice data ' format as well) of input and output, and allow connect arbitrary plugins.
Anyway, just our kind feedback. and also thanks @jezell provide a workaround for that
@taylorgwei I appreciate the feedback. Do you have a use case in mind that you are finding difficult to accomplish?
Livekit bring very good RTC to world with OpenSource or Cloud, Awesome! But Livekit Agent has one big problem:
The Livekit' VoiceAssistant ' Pipeline are hardcoded as combining VAD+STT+LLM+TTS ,which is pretty hard to customize it or bring a lot problems if everyone want to add extra/remove some plugin into the pipeline. The following cases are FAILED at 100%: if just need VAD+STT+LLM(remove TTS) ,VoiceAssistant may crash if remove VAD plugin(but keep others), VoiceAssistant may crash if just organized like VAD + Multimodal, VoiceAssistant may crash if want to extra process after TTS, there are no way insert a plugin into pipeline at end if want to customize chat_ctx dynamically ,it is complexity with a lot of code change
The related hard-code are here : assistant = VoiceAssistant( vad=ctx.proc.userdata["vad"], stt=deepgram.STT(), llm=openai.LLM(), tts=openai.TTS(), chat_ctx=initial_ctx, )
Result: right now Agent'Framework are good for Demo but not for product because every customer have very specific demands ,which ask general and easy way to customize flow.
Expect: VoiceAssistant should be a general pipeline framework, just manage data flow(txt,voice) between plugins and connect every plugin to finish a task. NOT depends type/purpose of plugin or how plugin work, NOT matter how plugin-inside logic
BTW: Latest version seems to be better than before by spliting VoiceAssistant to Pipeline concept ,but still hardcode inside.