Our system should allow developers to specify non-textual content in the replies from the agent.
For speech, we should support sound effects, emphasis, prosody. We might want to target Speech Markdown or SSMD, which are simplified syntaxes of SSML. Many of the modifiers are necessary to speak certain types correctly, and we should also support them natively.
Or we might want a simpler / more focused syntax, because tokenization might be challenging otherwise.
For GUI output, we should support pictures & RDL (link cards), as we used to do.
This issue is about the infrastructure code to support this and the design of the interface, so that developers can make use of richer interaction in #_[prompt] and #_[result] (and perhaps #_[canonical] if we know how to strip those non-textual markers on the user utterances).
A separate issue will be about generating these
Our system should allow developers to specify non-textual content in the replies from the agent.
For speech, we should support sound effects, emphasis, prosody. We might want to target Speech Markdown or SSMD, which are simplified syntaxes of SSML. Many of the modifiers are necessary to speak certain types correctly, and we should also support them natively. Or we might want a simpler / more focused syntax, because tokenization might be challenging otherwise.
For GUI output, we should support pictures & RDL (link cards), as we used to do.
This issue is about the infrastructure code to support this and the design of the interface, so that developers can make use of richer interaction in
#_[prompt]
and#_[result]
(and perhaps#_[canonical]
if we know how to strip those non-textual markers on the user utterances). A separate issue will be about generating these