[New Sample] Request for a speech-focused sample

stevengum commented 4 years ago

Is your feature request related to a problem? Please describe. We currently enable speech out of the box for the C# Echo Bot and Core Bot samples and generators. These samples are our "getting-started" samples and don't delve into the nuances of the protocol with speech.

Now that Direct Line Speech is GA, we should have a speech-focused sample.

Describe the solution you'd like A sample focused on speech (which may be through a headless device) should be created.

Features of the sample:

Discussion on setting of InputHints
Dialog design that doesn't rely on suggested actions, lists or cards
Examples and links of using SSML instead of just plain text for the Activity.Speak property.
- 13.core-bot example where a plain string is used for Activity.Speak

FYI @ryanlengel, @darrenj, @lauren-mills, @gabog who have experience in designing headless device solutions. Are there any "gotchas" that should be discussed in a speech-focused sample?

[enhancement]

gabog commented 4 years ago

Hi @stevengum, here are some notes from my end:

Not setting the inputhints properly can cause conversations to stop or clients like cortana to crash. This is often omitted because devs test on emulator.
The speech track (Speak property) doesn't need to match the Text property and it some cases it will be very different based on the channel. A channel with a screen could say: here are your appointments for today, a channel without it would probably ready the appointments out loud.
Pluralization, in some cases we can get away with some formatting in text, but with speech we need to put extra though on how we say "for one person" or "for two people".
Making dates more natural, in text we can present something like "for 11/15/2019 at 5 PM" which sounds OK but is not very natural in speech, sometimes it is better to have logic to parse dates and times and say something like "for tomorrow at 5 PM" or "next Monday at 5 in the afternoon", etc.
QnAMaker responses, some QnAMaker responses have a lot of text, this sound horrible in headless bots because we don't support barge in in many cases an you need to wait until the bot finishes talking before you can ask something else (I would say that QnAMaker is not Speech friendly in general).
Enumerations of suggested actions, in some cases we created logic to lead suggested actions out loud in the form of "You can say X, Y or Z" or "You can say X, Y and Z", the default would be to read read the list without and or or which sounds very wierd.
Not sure if this changed in webchat lately, but adaptive cards have a Speak property that is not used by wechat and may be confusing to some devs.

On the understanding side.

Speech normalization may add extra "." at the end of an utterance that would confuse LUIS.
Soundex, sometimes the suggested actions may want you to chose the name of a person or the name of a place that is not trained in LUIS (like a restaurant name), speech will not always understand the exact string you are looking for and you will need to match the utterance against the suggested actions list using soundex or some other algorithm rather than straight string matching.
Utterances can be long and vague, LUIS is supper important when using speech input.

This is all I can think of so far. Will update this post if I can think of anything else.

darrenj commented 4 years ago

Gabo has most things covered.

Decorating any Speak property with SSML enables you to control the voice and even the tone of voice. For example
Sample will need to have the steps required to enable websockets (for direct line speech) on the app service. We do this automatically as part of VA and the ARM template.
There is a test harness for speech, you can see this and some other instructions here
I think the sample should use Language Generation albeit in preview form as this will allow us to show how to provide speech and text friendly responses. e.g.

# NewUserIntroCard
[Activity
    Text= Some text
    Speak=Speech friendly response
]

stevengum commented 4 years ago

Sample will need to have the steps required to enable websockets (for direct line speech) on the app service. We do this automatically as part of VA and the ARM template.

The C# Core bot and Echo bot ARM templates were updated to support WebSockets, and the Startup.cs was updated to include the necessary app.UseWebSockets(); call. So we should be set here in regards to enabling WebSocket usage from the bot and on the App Service; we just need to mirror this work in the Speech-first sample.

There is work to be done on the Resource Provider to enable creating of the DLS channel via ARM templates and Azure CLI which I believe @DDEfromOR is working on.

There is a test harness for speech, you can see this and some other instructions here

We do need to update the Core and Echo bot READMEs to mention the test DLS client and the Speech SDKs.

The speech track (Speak property) doesn't need to match the Text property and it some cases it will be very different based on the channel. A channel with a screen could say: here are your appointments for today, a channel without it would probably ready the appointments out loud.

For DLS, the current behavior is that the Speak property needs to be set, the channel does not use the Text property from Activity for Speech generation.

Enumerations of suggested actions, in some cases we created logic to lead suggested actions out loud in the form of "You can say X, Y or Z" or "You can say X, Y and Z", the default would be to read read the list without and or or which sounds very wierd.

For non-headless devices/UIs (headful? heady?) it is important to preserve any use of GUI as applicable. However, if possible I think that building for one channel (DLS, Web Chat with Speech, or Cortana) and then generalizing is the better approach. We've seen this approach with MS Teams which has a lot more

Not sure if this changed in webchat lately, but adaptive cards have a Speak property that is not used by wechat and may be confusing to some devs.

@compulim?

ryanisgrig commented 4 years ago

For reference we have a tutorial on enabling DLS with the VA at https://microsoft.github.io/botframework-solutions/clients-and-channels/tutorials/enable-speech/1-intro/

Most of the steps are turning on the resources for the bot to work but there is some on how to change the voice with SSML.

johnataylor commented 4 years ago

We agreed to postpone major new samples until after we target dotnet 3.1

cleemullins commented 4 years ago

@johnataylor What are we doing with this? Can Monica, Michael, Ashely, or Eric drive this one?

microsoft / BotBuilder-Samples

[New Sample] Request for a speech-focused sample #1981