Question regarding handling special tokens in conversation transcription
First of all, thanks for making this wonderful SDK to easily create voice-enabled applications!
I'm currently building a quiz agent that asks questions to users. The user's response is evaluated by an LLM, and if it's correct, the agent congratulates the user and adds a special 'QUESTION_END' token in the response. This token is used to identify that the conversation related to the current question is finished. I then create a new chat context with the next question.
Issue
The issue I'm facing is that while I have removed the special 'QUESTION_END' token in the before_tts_cb function, so my agent does not speak 'QUESTION_END', it still appears in the conversation text. It seems that the 'QUESTION_END' text was not removed from the LLM's response before transcription.
Upon further investigation, I discovered the following code is executed after calling the LLM, which utilizes two different variables, tts_source and transcription_source, for different purposes:
def _synthesize_agent_speech(
self,
speech_id: str,
source: str | LLMStream | AsyncIterable[str],
) -> SynthesisHandle:
assert (
self._agent_output is not None
), "agent output should be initialized when ready"
if isinstance(source, LLMStream):
source = _llm_stream_to_str_iterable(speech_id, source)
og_source = source
transcript_source = source
if isinstance(og_source, AsyncIterable):
og_source, transcript_source = utils.aio.itertools.tee(og_source, 2)
tts_source = self._opts.before_tts_cb(self, og_source)
if tts_source is None:
raise ValueError("before_tts_cb must return str or AsyncIterable[str]")
return self._agent_output.synthesize(
speech_id=speech_id,
tts_source=tts_source,
transcript_source=transcript_source,
transcription=self._opts.transcription.agent_transcription,
transcription_speed=self._opts.transcription.agent_transcription_speed,
sentence_tokenizer=self._opts.transcription.sentence_tokenizer,
word_tokenizer=self._opts.transcription.word_tokenizer,
hyphenate_word=self._opts.transcription.hyphenate_word,
)
The tts_source does not contain the 'QUESTION_END' token, so it won't be played out, but the transcription_source contains 'QUESTION_END', causing it to be included in the transcription when committed.
Request for Enhancement
Currently, I've been unable to find a way to remove the 'QUESTION_END' text from the transcription, forcing me to use a crude hack to remove it from my frontend.
I am looking for an after_llm_cb function or similar hook that would allow for observation and modification of the LLM-generated text before it's committed to transcription.
Question regarding handling special tokens in conversation transcription
First of all, thanks for making this wonderful SDK to easily create voice-enabled applications!
I'm currently building a quiz agent that asks questions to users. The user's response is evaluated by an LLM, and if it's correct, the agent congratulates the user and adds a special 'QUESTION_END' token in the response. This token is used to identify that the conversation related to the current question is finished. I then create a new chat context with the next question.
Issue
The issue I'm facing is that while I have removed the special 'QUESTION_END' token in the
before_tts_cb
function, so my agent does not speak 'QUESTION_END', it still appears in the conversation text. It seems that the 'QUESTION_END' text was not removed from the LLM's response before transcription.Upon further investigation, I discovered the following code is executed after calling the LLM, which utilizes two different variables,
tts_source
andtranscription_source
, for different purposes:The
tts_source
does not contain the 'QUESTION_END' token, so it won't be played out, but thetranscription_source
contains 'QUESTION_END', causing it to be included in the transcription when committed.Request for Enhancement
Currently, I've been unable to find a way to remove the 'QUESTION_END' text from the transcription, forcing me to use a crude hack to remove it from my frontend.
I am looking for an
after_llm_cb
function or similar hook that would allow for observation and modification of the LLM-generated text before it's committed to transcription.Thank you!