Closed rmtuckerphx closed 2 years ago
Would getSpeech
(and getReprompt
) be responsible for pulling the value from message.speech || message.text || message
?
Would setSpeech(value: string)
(and setReprompt
) be responsible for changing a message
property that was a string into one that is an object with speech
and text
properties?
if (typeof template.message === 'string') {
template.message = {
speech: template.message,
text: this.ssmlProcessor.isPlainText(template.message)
? template.message
: this.ssmlProcessor.removeSSML(template.message),
};
}
if (!template.message?.speech && template.message?.text) {
template.message!.speech = template.message.text;
}
if (template.message?.speech) {
template.message.speech = value;
}
From our Slack thread, this is what @jankoenig said:
TTS is a bit difficult because all platforms have different structures. The TTS Plugin shouldn't be responsible for knowing the platform response structures. We're thinking about adding an abstract method to the Platform class, e.g. getSpeech and setSpeech (similar with reprompt) that can be implemented by each platform and then used by e.g. TTS Plugins to transform the speech.
The challenge with adding getSpeech
/setSpeech
on Platform
or JovoResponse
is that they deal with the entire response and not the multiple, possible children of output
(CoreResponse
) or response
(AlexaResponse
).
It seems like getSpeech
/setSpeech
should be on NormalizedOutputTemplate
(CoreResponse
) or Response
(AlexaResponse
) but there is no common base class.
But then you still have the issue of how to iterate through the multiple children in a common way.
What if there were an interface that all single response items would implement?
export interface ResponseItem {
getSpeech(): string;
setSpeech(value: string): void
getReprompt(): string;
setReprompt(value: string): void
}
// Alexa: Response
export class Response implements ResponseItem {
...
getSpeech(): string {
let speech: string = '';
const message = this.outputSpeech?.toMessage;
if (typeof message === 'string') {
speech = message;
}
if (message instanceof SpeechMessage) {
speech = message.speech;
}
if (message instanceof TextMessage) {
speech = message.text;
}
return speech;
}
setSpeech(value: string) {
this.type = OutputSpeechType.Ssml;
this.ssml = value;
this.text = undefined;
}
}
// Core: NormalizedOutputTemplate
export class NormalizedOutputTemplate implements ResponseItem {
...
getSpeech(): string {
return typeof this.message === 'string' ? this.message : this.message?.speech || this.message?.text || '';
}
setSpeech(value: string) {
if (typeof this.message === 'string') {
this.message = {
speech: this.message,
text: this.message,
};
}
this.message!.speech = value;
if (this.message?.text) {
this.message.text = removeSSML(this.message.text);
}
}
}
Then to iterate through each response item (and call getSpeech/setSpeech), we could have a method on each platform-specific class that would return an array of ResponseItem types:
// Platform
export abstract class Platform<
REQUEST extends JovoRequest = JovoRequest,
RESPONSE extends JovoResponse = JovoResponse,
// eslint-disable-next-line @typescript-eslint/no-explicit-any
JOVO extends Jovo<REQUEST, RESPONSE, JOVO, USER, DEVICE, PLATFORM> = any,
USER extends JovoUser<JOVO> = JovoUser<JOVO>,
DEVICE extends JovoDevice<JOVO> = JovoDevice<JOVO>,
// eslint-disable-next-line @typescript-eslint/no-explicit-any
PLATFORM extends Platform<REQUEST, RESPONSE, JOVO, USER, DEVICE, PLATFORM, CONFIG> = any,
CONFIG extends PlatformConfig = PlatformConfig,
> extends Extensible<CONFIG, PlatformMiddlewares> {
...
abstract getResponseItems(response: RESPONSE): ResponseItem[]
}
// Alexa: AlexaPlatform
export class AlexaPlatform extends Platform<
AlexaRequest,
AlexaResponse,
Alexa,
AlexaUser,
AlexaDevice,
AlexaPlatform,
AlexaConfig
> {
...
getResponseItems(response: AlexaResponse): ResponseItem[] {
return [(response.response as unknown) as ResponseItem];
}
}
// Core: CorePlatform
export class CorePlatform<PLATFORM extends string = 'core' | string> extends Platform<
CoreRequest,
CoreResponse,
Core,
CoreUser,
CoreDevice,
CorePlatform<PLATFORM>,
CorePlatformConfig
> {
...
getResponseItems(response: CoreResponse): ResponseItem[] {
const templates = this.outputTemplateConverterStrategy.fromResponse(response);
const items = Object.values(templates).map(t => {
return (t as unknown) as ResponseItem;
})
return items;
}
}
Finally, the TtsPlugin
base class could iterate through the response items and not know anything about the platform-specific implementation:
// TtsPlugin (base class implemented by all TTS plugins)
export abstract class TtsPlugin<
CONFIG extends TtsPluginConfig = TtsPluginConfig,
> extends Plugin<CONFIG> {
...
protected async tts(jovo: Jovo): Promise<void> {
const response = jovo.$response;
// if this plugin is not able to process tts, skip
if (!this.processTts || !response) {
return;
}
const responseItems = jovo.$platform.getResponseItems(response) as ResponseItem[];
for (const item of Object.values(responseItems)) {
const speech = item.getSpeech();
if (speech) {
// call specific TTS provider
const result = await this.processText(jovo, speech);
if (result && result.url) {
item.setSpeech(buildSpeakTag(buildAudioTag(result.url)));
}
}
}
}
}
@jankoenig @aswetlow Please see the above thread of me thinking through how this might be implemented. You know the platform better and have your own ideas.
I would like us to figure out the approach ASAP, so the changes can get into the framework and the base TtsPlugin can be implemented so we can start building TTS plugins.
Thank you!
Here is a branch with me trying to figure out where each of the types should go: https://github.com/rmtuckerphx/jovo-framework/tree/v4/feature/platform-tts-methods But there are errors. Need your expertise on this.
Also, I think each platform should surface which SSML tags they support: Web - audio, break Alexa - https://developer.amazon.com/en-US/docs/alexa/custom-skills/speech-synthesis-markup-language-ssml-reference.html
Also, each TTS plugin should say which SSML tags they support: Polly - https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html
Then there should be a way after calling getSpeech
to split the string into parts based on SSML tags that are supported by the Platform and those that are not. Those parts that aren't supported by the Platform will be passed to the TTS Plugin (Polly) and any unsupported tags will be removed and the resulting web url returned.
Note: a single string response from getSpeech
could result in multiple calls to TTS.
ex: "<audio src='https://example.com/audio1.mp3>Some text that could include SSML.<audio src='https://example.com/audio2.mp3> Some other SSML text."
This would be 2 calls to the TTS plugin (or maybe a single call with an array of text/ssml to process.
We need a way to put the string back together before calling setSpeech
:
"<audio src='https://example.com/audio1.mp3><audio src='https://example.com/tts1.mp3><audio src='https://example.com/audio2.mp3><audio src='https://example.com/tts2.mp3>"
Each TTS plugin should have access to common set of SSML related utility functions:
Also, something to consider when removing unsupported tags is that we may still need parts of the unsupported tag instead of removing it all.
For example, the say-as
tag:
<speak>
I was born on <say-as interpret-as="date" format="mdy">12-31-1900</say-as>.
</speak>
Maybe the platform and the TTS plugin don't support say-as
so when we call removeSSML
to remove unsupported tags, we will need the date to be preserved in the string:
I was born on 12-31-1900.
Maybe the TTS plugin (ex: Polly) or the base class TtsPlugin
could handle the processing of some SSML tags (such as say-as
) even if the TTS API doesn't support it. Like an SSML pre-processor.
The sub
tag is also a good candidate that can be done in code:
<sub alias="new word">abbreviation</sub>
Would
getSpeech
(andgetReprompt
) be responsible for pulling the value frommessage.speech || message.text || message
?Would
setSpeech(value: string)
(andsetReprompt
) be responsible for changing amessage
property that was a string into one that is an object withspeech
andtext
properties?
getSpeech
and setSpeech
would be Response
methods, they wouldn't read from an OutputTemplate
, but rather the $response
.
The challenge with adding getSpeech/setSpeech on Platform or JovoResponse is that they deal with the entire response and not the multiple, possible children of output (CoreResponse) or response (AlexaResponse).
I'm not sure about this one. If I understand it correctly, a TTS plugin wouldn't want to use multiple API calls for multiple output children. Rather, I'd want to have the final speech of the response JSON and then call a TTS API for this one.
EDIT: I see what you mean now. CorePlatform
and WebPlatform
use the output template structure for the output
part of the response. I'll think a bit more about this
It seems like getSpeech/setSpeech should be on NormalizedOutputTemplate (CoreResponse) or Response (AlexaResponse) but there is no common base class.
Here is the base JovoResponse
class: https://github.com/jovotech/jovo-framework/blob/v4/latest/output/src/models/JovoResponse.ts
And here's AlexaResponse
, for example: https://github.com/jovotech/jovo-framework/blob/v4/latest/platforms/platform-alexa/src/AlexaResponse.ts
I'll have to talk this through with @aswetlow tomorrow, but I'd suggest adding abstract methods to JovoResponse
and then have all platforms implement it.
Question is what we should do if a platform doesn't support speech.
This was released today: https://github.com/jovotech/jovo-framework/releases/tag/2022-07-28-patch
I'm submitting a...
Expected Behavior
Be able to call
getSpeech
/setSpeech
andgetReprompt
/setReprompt
values in a way that is consistent across platforms. These methods would access the underlying values for CoreResponse or AlexaResponse, etc.Current Behavior
Currently, you need to code each response platform separately:
Error Log
N/A
Your Environment
@jovotech/cli: 4.1.6
Jovo packages of the current project :
Environment: System: OS: Windows 10 10.0.22000 Binaries: Node: 14.19.0 - C:\Program Files\nodejs\node.EXE npm: 8.10.0 - C:\Program Files\nodejs\npm.CMD