Open davidzhao opened 2 weeks ago
It would be huge if we could even pre-buffer the audio for this message... Thanks guys!
It would be huge if we could even pre-buffer the audio for this message... Thanks guys!
Hey, what do you mean by pre-buffer, do you want to synthesize the agent speech ahead of time?
@theomonnom Hi Theo, yes exactly. Then we could get to very nice and low perceived latencies - on par with openairealtime- playing an initial prebuffered first reaction, as we are preparing the full answer.
Hume for example is also offering this feature.
@davidzhao Where do you think the "quick response" before the function call done should come from: prompt the LLM to say something after function_calls_collected or allow the function to early return a customized text before it done?
some folks are already prompting LLM to return the right text. I think others are choosing to use agent.say
to queue up a custom placeholder response.
currently when trying the latter path, the speech doesn't get enqueued right away
some folks are already prompting LLM to return the right text. I think others are choosing to use
agent.say
to queue up a custom placeholder response.currently when trying the latter path, the speech doesn't get enqueued right away
I see. I can take a look if it's an issue of agent.say
doesn't work as expected.
The current function call logic is like the following pseudo-code
fncs = speech_handle.synthesis_handle.play().join()
chat_ctx = speech_handle.source.chat_ctx
nest = 0
while fncs and nest < max_nested:
tools_messages = [fnc.exectue() for fnc in fncs]
answer_llm_stream = llm.chat(chat_ctx.copy() + tools_messages)
speech_handle.synthesis_handle = _synthesize_agent_speech(answer_llm_stream)
fncs = speech_handle.synthesis_handle.play().join()
nest += 1
So that the function call will block all later speech (also the one from say
), and all the function calls share the same base chat_ctx from the speech_handle.source
. Not sure if it's intended, the nested function call seems cannot be interrupted as the interrupted = answer_synthesis.interrupted
is not used, though this might rarely happen https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/pipeline/pipeline_agent.py#L820
I have a proposal that we send the function calls to a fnc_queue and execute them in separate task. It records the nest index and the base chat_ctx
for each function call task so it should have the same behavior like the current version. And any user interruption will cancel all the queued function calls. The structure might be
# In _play_speech
fncs = speech_handle.synthesis_handle.play().join()
chat_ctx = speech_handle.source.chat_ctx
fnc_nest_count = speech_handle.fnc_nest_count
if interrupted:
fncs_q.clear()
elif fnc_nest_count < max_nested:
fncs_q.put(fncs, chat_ctx.copy(), fnc_nest_count)
# In another task
while True:
fncs, chat_ctx, nest_count = fncs_q.get()
tools_messages = [fnc.exectue() for fnc in fncs]
answer_llm_stream = llm.chat(chat_ctx.copy() + tools_messages)
new_speech_handle = create_speech_handle(
_synthesize_agent_speech(answer_llm_stream), fnc_nest_count=nest_count+1
)
speech_q.put(new_speech_handle)
@theomonnom what do you think, is there any potential risk for implementing like that?
During a function call, sometimes it's desirable to let the user know that this is going to take awhile. While it is possible to prompt the LLM to return a text response while this is happening, it would offer more control to be able to do
agent.say("....")
.Currently when that is done, the speech gets queued after the LLM's response. We'd want the ability to inject a response as the immediate thing that's communicated.