livekit / agents

Build real-time multimodal AI applications 🤖🎙️📹
https://docs.livekit.io/agents
Apache License 2.0
3.99k stars 411 forks source link

Feature: add after_llm_cb to modify llm generated text #1044

Open achapla opened 1 week ago

achapla commented 1 week ago

Question regarding handling special tokens in conversation transcription

First of all, thanks for making this wonderful SDK to easily create voice-enabled applications!

I'm currently building a quiz agent that asks questions to users. The user's response is evaluated by an LLM, and if it's correct, the agent congratulates the user and adds a special 'QUESTION_END' token in the response. This token is used to identify that the conversation related to the current question is finished. I then create a new chat context with the next question.

Issue

The issue I'm facing is that while I have removed the special 'QUESTION_END' token in the before_tts_cb function, so my agent does not speak 'QUESTION_END', it still appears in the conversation text. It seems that the 'QUESTION_END' text was not removed from the LLM's response before transcription.

Upon further investigation, I discovered the following code is executed after calling the LLM, which utilizes two different variables, tts_source and transcription_source, for different purposes:

def _synthesize_agent_speech(
    self,
    speech_id: str,
    source: str | LLMStream | AsyncIterable[str],
) -> SynthesisHandle:
    assert (
        self._agent_output is not None
    ), "agent output should be initialized when ready"

    if isinstance(source, LLMStream):
        source = _llm_stream_to_str_iterable(speech_id, source)

    og_source = source
    transcript_source = source
    if isinstance(og_source, AsyncIterable):
        og_source, transcript_source = utils.aio.itertools.tee(og_source, 2)

    tts_source = self._opts.before_tts_cb(self, og_source)
    if tts_source is None:
        raise ValueError("before_tts_cb must return str or AsyncIterable[str]")

    return self._agent_output.synthesize(
        speech_id=speech_id,
        tts_source=tts_source,
        transcript_source=transcript_source,
        transcription=self._opts.transcription.agent_transcription,
        transcription_speed=self._opts.transcription.agent_transcription_speed,
        sentence_tokenizer=self._opts.transcription.sentence_tokenizer,
        word_tokenizer=self._opts.transcription.word_tokenizer,
        hyphenate_word=self._opts.transcription.hyphenate_word,
    )

The tts_source does not contain the 'QUESTION_END' token, so it won't be played out, but the transcription_source contains 'QUESTION_END', causing it to be included in the transcription when committed.

Request for Enhancement

Currently, I've been unable to find a way to remove the 'QUESTION_END' text from the transcription, forcing me to use a crude hack to remove it from my frontend.

I am looking for an after_llm_cb function or similar hook that would allow for observation and modification of the LLM-generated text before it's committed to transcription.

Thank you!

davidzhao commented 1 week ago

this is a good point, we should offer a way to override before it's committed to the context.