livekit / agents

Build real-time multimodal AI applications 🤖🎙️📹
https://docs.livekit.io/agents
Apache License 2.0
3.89k stars 391 forks source link

before_llm_cb is not triggered for text messages from user #783

Open willsmanley opened 1 month ago

willsmanley commented 1 month ago

before_llm_cb is only called when there is an audio message from the user

If there is a text message, before_llm_cb is not called.

It seems like, for consistency purposes, this method should also be called for text messages.

theomonnom commented 1 month ago

Hey, it is because the LLMStream is created manually when using the chat messages: See https://github.com/livekit/agents/blob/fe4471aa147346d4357c542b93917605c6700750/examples/voice-assistant/minimal_assistant.py#L63

willsmanley commented 1 month ago
Screenshot 2024-09-23 at 6 19 54 PM

that makes sense, but the result of that decision is that the assistant has no memory of recent text messages as shown in this example with kitt.

if chat conversational memory is not supported in the same way as voice conversational memory, it seems like that chat option shouldn't be supported

davidzhao commented 1 month ago

it is a problem that text messages are missing from convo history. I think we should standardize on VoiceAssistant handling Chat automatically once the new Chat protocol makes it in. wdyt @theomonnom @lukasIO @bcherry ?

willsmanley commented 1 month ago

one quick fix is to not copy the chat context and just append the message to the true context. you'd also have to make sure that you manually invoke before_llm_cb on the application side.

# before
async def answer_from_text(txt: str):
     chat_ctx = assistant.chat_ctx.copy()
     chat_ctx.append(role="user", text=txt)
     stream = assistant.llm.chat(chat_ctx=chat_ctx)
     await assistant.say(stream)

# after
async def answer_from_text(txt: str):
     assistant.chat_ctx.append(role="user", text=txt)
     stream = await before_llm_cb(assistant, assistant.chat_ctx)
     await assistant.say(stream)

however this still does not trigger function calling or interruptions and requires a minor abstraction leak. so it would be ideal to have chat mode more natively supported in the same way voice is.

it would also be really neat to have the option to disable voice synthesis for use in pure chat mode (aggressively stream transcripts instead of waiting for voice synthesis timings and save on voice synthesis usage). I opened a separate issue for this request since it is related but different scope: https://github.com/livekit/agents/issues/791