Allow queuing custom responses from function call

davidzhao commented 2 weeks ago

During a function call, sometimes it's desirable to let the user know that this is going to take awhile. While it is possible to prompt the LLM to return a text response while this is happening, it would offer more control to be able to do agent.say("....").

Currently when that is done, the speech gets queued after the LLM's response. We'd want the ability to inject a response as the immediate thing that's communicated.

Don-Chad commented 1 week ago

It would be huge if we could even pre-buffer the audio for this message... Thanks guys!

theomonnom commented 1 week ago

It would be huge if we could even pre-buffer the audio for this message... Thanks guys!

Hey, what do you mean by pre-buffer, do you want to synthesize the agent speech ahead of time?

Don-Chad commented 6 days ago

@theomonnom Hi Theo, yes exactly. Then we could get to very nice and low perceived latencies - on par with openairealtime- playing an initial prebuffered first reaction, as we are preparing the full answer.

Hume for example is also offering this feature.

longcw commented 5 days ago

@davidzhao Where do you think the "quick response" before the function call done should come from: prompt the LLM to say something after function_calls_collected or allow the function to early return a customized text before it done?

davidzhao commented 4 days ago

some folks are already prompting LLM to return the right text. I think others are choosing to use agent.say to queue up a custom placeholder response.

currently when trying the latter path, the speech doesn't get enqueued right away

longcw commented 4 days ago

some folks are already prompting LLM to return the right text. I think others are choosing to use agent.say to queue up a custom placeholder response.

currently when trying the latter path, the speech doesn't get enqueued right away

I see. I can take a look if it's an issue of agent.say doesn't work as expected.

longcw commented 3 days ago

The current function call logic is like the following pseudo-code

fncs = speech_handle.synthesis_handle.play().join()
chat_ctx = speech_handle.source.chat_ctx
nest = 0
while fncs and nest < max_nested:
  tools_messages = [fnc.exectue() for fnc in fncs]

  answer_llm_stream = llm.chat(chat_ctx.copy() + tools_messages)
  speech_handle.synthesis_handle = _synthesize_agent_speech(answer_llm_stream)

  fncs = speech_handle.synthesis_handle.play().join()
  nest += 1

So that the function call will block all later speech (also the one from say), and all the function calls share the same base chat_ctx from the speech_handle.source. Not sure if it's intended, the nested function call seems cannot be interrupted as the interrupted = answer_synthesis.interrupted is not used, though this might rarely happen https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/pipeline/pipeline_agent.py#L820

I have a proposal that we send the function calls to a fnc_queue and execute them in separate task. It records the nest index and the base chat_ctx for each function call task so it should have the same behavior like the current version. And any user interruption will cancel all the queued function calls. The structure might be

# In _play_speech
fncs = speech_handle.synthesis_handle.play().join()
chat_ctx = speech_handle.source.chat_ctx
fnc_nest_count = speech_handle.fnc_nest_count

if interrupted:
  fncs_q.clear()
elif fnc_nest_count < max_nested:
  fncs_q.put(fncs, chat_ctx.copy(), fnc_nest_count)

# In another task
while True:
  fncs, chat_ctx, nest_count = fncs_q.get()
  tools_messages = [fnc.exectue() for fnc in fncs]
  answer_llm_stream = llm.chat(chat_ctx.copy() + tools_messages)
  new_speech_handle = create_speech_handle(
    _synthesize_agent_speech(answer_llm_stream), fnc_nest_count=nest_count+1
  )
  speech_q.put(new_speech_handle)

@theomonnom what do you think, is there any potential risk for implementing like that?

livekit / agents

Allow queuing custom responses from function call #1054