Proposal: Workshop on User-friendly Smart Agents on the Web - Voice interaction and what's next?

ashimura commented 4 years ago

General background

Thanks to the advancement of HTML5 and related Web technologies, Web applications with speech capability is getting more and more popular for ease of user-interaction and richer user experience such as Apple's SIRI, Google's Voice Assistant and Amazon's Alexa.
Thus voice agents are one of the essential applications on various devices especially mobile phones, tablet devices, eBook readers, and gaming platforms. In addition, traditional platforms such as TV's, audio systems and automobiles are rapidly becoming capable of speech interaction.
Also these days smart speakers are getting a portal for IoT services including smart homes.
On the other hand, during one of the breakout sessions at TPAC 2019 in Fukuoka, there was discussion about potential needs for improved voice agents for web services.

Focus

See the current status of the voice-enabled smart platforms integrating multi-vendor services/applications/devices:
- Interaction with smart devices
- Control from Web browsers
- Interoperability and access to controls for accessibility/usability, e.g., smart navigation
And discuss what is missing for the voice interaction technology to be deployed globally for all the languages throughout the world

Possible topics

Summary of the current status:
- What we already have in the browsers
- Common issues on the interoperability among the browsers/platforms
- Missing features with the existing APIs
Needs of the users and the developers:
- Smarter interaction for easier use, e.g.:
- Short and clear commands using mixed input including voice and the other modalities (and possible mechanism to define the command sets)
- Smarter dialog model between human and the system
- Adjusting the system behavior based on the user's responses
- Improved pronunciation for speech synthesis in various languages. See also the APA WG's video for TPAC 2020.
- Applying the advanced voice technology for Web services (Speech style, Expression, Feeling, Emotion, etc.)
- Dealing with both the input entities (sensors/applications) and output entities (actuators/devices/digital twins) from various vendors
- Presentation issues such as how/what/when to transfer necessary information from the input entities (possibly the users, devices or applications) and present it to the output entities (also possibly the users, devices or applications)
- Integration of multiple interchangeable modalities (typing, handwriting, voice, etc.)
Underlying technologies:
- Smarter dialog management model for integrating multiple applications
- Applications/services/devices from different multiple vendors
- Unified data format and protocols for data transfer
- State transition management model and service lifecycle
- Natural language processing technology and resources for that, e.g., phonetic databases, parallel corpola
- Possible improved model and architecture for voice interaction and expected technologies (also the relationship between those technologies and the Web technology)
Horizontal platform:
- Discovery of resources
- Privacy and security
- Accessibility and usability
- Internationalization and compatibility with region-specific technology

Example of related use cases

Related technology area is broad including:

Voice agent
Connected car
Smart homes/Smart factories/Smart cities
Smart speakers/Smartphones as a portal/user device
IoT services in general

For example, Hybrid TV services (Web+TV integration based on HTML5, Second Screen, TTML, etc.) and smart home devices and services. Possibly additional proprietary technology like MiniApps, e.g.:

Asking the voice agent "I want to order takeaway." on the TV in the living room to order a pizza.
Choosing the food using voice commands and saying "checkout" to the smart watch would process the payment.

Another example is in searching for television content. The user asks "Play [name of an episodic TV show]" and the voice assistant will speak "Here's what I found" while displaying search results on the TV. A useful user requirement may be the ability to request congruent user feedback (ie if voice is used for input then speech is used for feedback).

'''NOTE:''' The above is just a few example of the possible use cases, and your own use cases are welcome and really appreciated :)

Who to attend?

Many possible stakeholders including:
- Service providers/System implementers
- Govt (like Singapore)
- Users from various countries/communities
Liaisons
- CHIP (Amazon, Apple, Google and Zigbee)
- OCF
- oneM2M
- Singapore Govtech

r12a commented 4 years ago

One of the major obstacles to voice interaction becoming a worldwide technology is the need for language resources such as speech synthesis and recognition tools that can work with all languages.

Development of the latter, in turn, depends on the availability of data (which is derived from phonetic databases, parallel corpora, etc.) for training purposes, and natural language processing technologies.

It would be good to dedicate a part of the workshop to consideration of the risks and opportunities related to the availability of such language technologies and data required to support deployment of voice interaction around the world. In particular, the workshop should aim to highlight those areas where such things are lacking, and estimate the outlook for closing those gaps.

wseltzer commented 4 years ago

Updates/replaces https://github.com/w3c/strategy/issues/71

bkardell commented 4 years ago

Is there a typo in

On the other hand, during one of the breakout sessions at TPAC 2019 in Fukuoka about need for improved voice for web services.

I'm not sure what this means. I was at this breakout session and it seemed kind of all over.

I tried to not be too disruptive in here and see what was going on, but as I said then: It would be really good to address the stuff we already have that is actually shipping in all browsers for years that isn't even rec track, create higher interop, fix problems with the API, find ways forward and properly document the real things. I expect this rather basic move actually refocuses many of these specific conversations. Voices are broken and useless in the current model and a security problem too.. there are fairly straightforward kinds of solutions to that that are being used in CSS with fonts, for example. The docs talk about voice xml or ssml, but these aren't supported and actually are problematic in what happens if you try to, etc. So.. if we could begin to address this thing, it would be great because there are sites and apps that use it even in is current form. Regardless of procedural status about how we got here, there is defacto standard here which it seems bad to not resolve or to imply must not be resolved and can be removed.

himorin commented 4 years ago

No further comment from i18n WG.

brewerj commented 4 years ago

Noting the accessibility points mentioned under "focus" above,

Voice-enabled smart platform for integrating multi-vendor services/applications/devices Interaction with smart devices Control from Web browsers ** Smart navigation for accessibility/usability

I suggest that accessibility issues also include interoperability and access to controls, not just navigation. Also, we ( @michael-n-cooper and me) recommend @KimPatch 's opinion be requested on this. Accessibility Horizontal Review may have further comments as your Workshop plans evolve. Thanks, Kaz!

samuelweiler commented 3 years ago

What is the architectural vision for these components? I ask, noting that the first paragraph of the intro above cites Siri and Alexa, both of which I think of as not-web. I want to understand the architecture being proposed and how it's "web".

KimPatch commented 3 years ago

A couple of thoughts: I think this is an important and timely subject, and it’s a good idea to have a workshop to explore the many facets of navigation and control via speech.

I very much like the mention of mixed input. I think that’s going to be increasingly important.
I think the examples could be improved:

If you knew you wanted pizza, you’d go right to something like “takeaway pizza” rather than going through unnecessary steps unless you’re just testing out a system for the first time or unless the interface affords you no choice, in which case it would be frustrating after a day or two. You’d also want a fairly short command for something like this.

However, having a short command like “checkout” for something that’s difficult and/or annoying to reverse is impractical. The words themselves are also potentially ambiguous because they have other meanings that are commonly used – “check out this television show”. I think the best way to handle actions that are difficult to undo is 1. with mixed input by default, and 2. a cognitively easy, but robust a speech-only option like a two-part command with an answer prompt in the middle, e.g. User: “ready to pay” System: unique pay sound, “confirm pay” User: “confirm pay”.

I think a key thing, which addresses internationalization as well, is making sure that users can adjust, save and share command sets. I can’t stress this enough. This is important across interfaces, but especially important for interfaces like speech where there are a lack of good defaults, but there are also speech interface users with many years of experience. In a perfect world, speech interfaces would have good default command sets but also be adjustable in a way that users can share.

RealJoshue108 commented 3 years ago

Thanks Kaz @ashimura I will also flag this to the Accessible Platform Architectures group and Research Questions task force and see what other input we have.

JWJPaton commented 3 years ago

I notice hybrid TV services are mentioned. On the topic of accessibility there is a tendency for smart TVs to default to providing information to the user via the screen. Whilst this may be the preferred route for a lot of people users with sight loss who chose a voice interaction for it's accessibility then lose out since they can't access the feedback. A use case is in searching for television content. The user asks "Play [name of an episodic TV show]" and the voice assistant will speak "Here's what I found" while displaying search results on the TV. A useful user requirement may be the ability to request congruent user feedback (ie if voice is used for input then speech is used for feedback).

michael-n-cooper commented 3 years ago

APA thinks its work on Pronunciation is critical to this. The group would like Pronunciation to be listed as a topic for this workshop, and will certainly be interested in participating.

michael-n-cooper commented 3 years ago

APA prepared a video on pronunciation for TPAC 2020: https://www.w3.org/2020/10/TPAC/apa-pronunciation.html.

michael-n-cooper commented 3 years ago

APA review complete; over to @brewerj to complete accessibility horizontal review.

ashimura commented 3 years ago

@r12a @wseltzer @bkardell @himorin @brewerj @samuelweiler @KimPatch @RealJoshue108 @JWJPaton and @michael-n-cooper thank you very much for your inputs! And very sorry I've been swamped and couldn't respond earlier.

I've updated the Workshop proposal description at the top based on all your comments.

Also as you (might) know, I'll organize a breakout session on the expected Voice Agents Workshop during TPAC 2020. So please join the session if possible.

During the breakout session, I'd like to identify the people who are interested in the workshop as (1) the Program Committee for the workshop and (2) the speakers/participants in the workshop. Also it would be great to get further insights on the potential agenda for the expected workshop as well.

philarcher commented 3 years ago

Leaving a comment here so I'm sure to be notified of future conversations on this topic. Put me down for the PC.

KimPatch commented 3 years ago

Here are a couple more comments – a couple of things to keep in mind:

What we say and hear are related, and that relationship also has a lot to do with to ease-of-use. We are mimics—we will automatically use what we’ve heard. Making sure development attends to this automatic way of learning will lower the technology’s cognitive load.

I think the way to deal with this is to
- Make sure to coordinate speech in and speech out
- And like I mentioned above, make sure that vocabularies are not set in stone, but are good defaults that can be changed, saved and shared by users, similar to the way human-human language develops.
This is kind of a cautionary tale. When you have devices listening all the time and you have devices that can talk, devices can tangle. This can be funny at first glance, but having to watch speech recognition like a hawk to make sure it doesn’t accidentally say something wrong when you’re speaking is draining. It’s even worse if you have to pay attention to something all the time. Users need good strategies to limit the chances for catastrophe that are more effective and doable than having these devices be yet something else that takes extra attention.

bkardell commented 2 years ago

Happy to do what I can here, feel free to reach out to me on the workshop

ashimura commented 2 years ago

Thanks a lot @bkardell !!!

mhakkinen commented 2 years ago

@ashimura, echoing @bkardell, please feel free to reach out to me on the workshop. I will also reach out to contacts in the emergency communications space to see if there is any interest in spoken interactions.

ashimura commented 2 years ago

Thank you @mhakkinen !!!

bevcorwin commented 2 years ago

Happy to help, looking forward to updates.

plehegar commented 2 weeks ago

@ashimura , breakout for TPAC 2024?

w3c / strategy