user_speech_committed event is never fired using RealtimeModel/MultimodalAgent

zacharyw commented 1 day ago

Hello - I'm not sure if this is a bug, or just something I'm doing wrong.

I am creating a model:

model = openai.realtime.RealtimeModel(
            instructions=data['globalPrompt'],
            voice='shimmer',
            temperature=0.8,
            # max_response_output_tokens=float('inf'),
            modalities=['audio'],
            turn_detection=openai.realtime.ServerVadOptions(
                threshold=0.9, prefix_padding_ms=200, silence_duration_ms=500
            ),
        )

agent = MultimodalAgent(model=model)

I have event handlers defined for when speech is committed:

@agent.on("user_speech_committed")
def on_user_speech_committed(msg: llm.ChatMessage):
    logger.info(f"User speech committed: {msg}")

@agent.on("agent_speech_committed")
def on_agent_speech_committed(msg: llm.ChatMessage):
    logger.info(f"Agent speech committed: {msg}")

During a conversation, the agent_speech_committed event is fired normally and the msg param contains the AI's response.

However, the user_speech_committed event is never picked up.

In addition, in the debug logs, I can see a user conversation item being created with audio, but the transcription is blank:

DEBUG livekit.plugins.openai.realtime - conversation item created {"type": "conversation.item.created", "event_id": "event_AYCyhuvHvIlGLSpkTu6MH", "previous_item_id": "item_AYCyaCV55zc87azG5Z4cz", "item": {"id": "item_AYCyhNzFR1Nobo2kKvcCW", "object": "realtime.item", "type": "message", "status": "completed", "role": "user", "content": [{"type": "input_audio", "transcript": null}]}, "pid": 1181, "job_id": "AJ_XvWaLLkWk3Hv"}

I'm not sure if that could be related to the event not firing or not.

longcw commented 1 day ago

The transcription is expected to be empty when the conversation item is created. The transcription should be included in a message sent later by the realtime API, and the user_speech_committed event will be emitted when the agent receives the transcription.

There should be a debug log for committed user speech for example

2024-11-27 22:28:40,980 - DEBUG livekit.agents - committed user speech {"user_transcript": "Hello, hello.\n", "pid": 686607, "job_id": "AJ_2QWC7zGGTTk9"}

If it's not there, could you share more logs for debugging?

zacharyw commented 1 day ago

Hmm, I restarted my docker container, without having changed anything, and now the event is being picked up it looks like, and I'm seeing events trigger on both sides now, sorry for the errant issue.

I will say though that the transcription is radically different from actual audio that the AI picked up and used. I'm imagining this is due to discrepancies between the realtime model and the whisper model used to generate the transcript?

I'm not sure if there's anything I can do to improve that, though.

livekit / agents

user_speech_committed event is never fired using RealtimeModel/MultimodalAgent #1142