w3c / strategy

team-strat, on GitHub, working in public. Current state: DRAFT
151 stars 45 forks source link

Proposal: Workshop on User-friendly Smart Agents on the Web - Voice interaction and what's next? #221

Open ashimura opened 4 years ago

ashimura commented 4 years ago

General background

Focus

Possible topics

Example of related use cases

Related technology area is broad including:

For example, Hybrid TV services (Web+TV integration based on HTML5, Second Screen, TTML, etc.) and smart home devices and services. Possibly additional proprietary technology like MiniApps, e.g.:

Another example is in searching for television content. The user asks "Play [name of an episodic TV show]" and the voice assistant will speak "Here's what I found" while displaying search results on the TV. A useful user requirement may be the ability to request congruent user feedback (ie if voice is used for input then speech is used for feedback).

'''NOTE:''' The above is just a few example of the possible use cases, and your own use cases are welcome and really appreciated :)

Who to attend?

r12a commented 4 years ago

One of the major obstacles to voice interaction becoming a worldwide technology is the need for language resources such as speech synthesis and recognition tools that can work with all languages.

Development of the latter, in turn, depends on the availability of data (which is derived from phonetic databases, parallel corpora, etc.) for training purposes, and natural language processing technologies.

It would be good to dedicate a part of the workshop to consideration of the risks and opportunities related to the availability of such language technologies and data required to support deployment of voice interaction around the world. In particular, the workshop should aim to highlight those areas where such things are lacking, and estimate the outlook for closing those gaps.

wseltzer commented 4 years ago

Updates/replaces https://github.com/w3c/strategy/issues/71

bkardell commented 4 years ago

Is there a typo in

On the other hand, during one of the breakout sessions at TPAC 2019 in Fukuoka about need for improved voice for web services.

I'm not sure what this means. I was at this breakout session and it seemed kind of all over.

I tried to not be too disruptive in here and see what was going on, but as I said then: It would be really good to address the stuff we already have that is actually shipping in all browsers for years that isn't even rec track, create higher interop, fix problems with the API, find ways forward and properly document the real things. I expect this rather basic move actually refocuses many of these specific conversations. Voices are broken and useless in the current model and a security problem too.. there are fairly straightforward kinds of solutions to that that are being used in CSS with fonts, for example. The docs talk about voice xml or ssml, but these aren't supported and actually are problematic in what happens if you try to, etc. So.. if we could begin to address this thing, it would be great because there are sites and apps that use it even in is current form. Regardless of procedural status about how we got here, there is defacto standard here which it seems bad to not resolve or to imply must not be resolved and can be removed.

himorin commented 4 years ago

No further comment from i18n WG.

brewerj commented 4 years ago

Noting the accessibility points mentioned under "focus" above,

Voice-enabled smart platform for integrating multi-vendor services/applications/devices Interaction with smart devices Control from Web browsers ** Smart navigation for accessibility/usability

I suggest that accessibility issues also include interoperability and access to controls, not just navigation. Also, we ( @michael-n-cooper and me) recommend @KimPatch 's opinion be requested on this. Accessibility Horizontal Review may have further comments as your Workshop plans evolve. Thanks, Kaz!

samuelweiler commented 3 years ago

What is the architectural vision for these components? I ask, noting that the first paragraph of the intro above cites Siri and Alexa, both of which I think of as not-web. I want to understand the architecture being proposed and how it's "web".

KimPatch commented 3 years ago

A couple of thoughts: I think this is an important and timely subject, and it’s a good idea to have a workshop to explore the many facets of navigation and control via speech.

If you knew you wanted pizza, you’d go right to something like “takeaway pizza” rather than going through unnecessary steps unless you’re just testing out a system for the first time or unless the interface affords you no choice, in which case it would be frustrating after a day or two. You’d also want a fairly short command for something like this.

However, having a short command like “checkout” for something that’s difficult and/or annoying to reverse is impractical. The words themselves are also potentially ambiguous because they have other meanings that are commonly used – “check out this television show”. I think the best way to handle actions that are difficult to undo is 1. with mixed input by default, and 2. a cognitively easy, but robust a speech-only option like a two-part command with an answer prompt in the middle, e.g. User: “ready to pay” System: unique pay sound, “confirm pay” User: “confirm pay”.

RealJoshue108 commented 3 years ago

Thanks Kaz @ashimura I will also flag this to the Accessible Platform Architectures group and Research Questions task force and see what other input we have.

JWJPaton commented 3 years ago

I notice hybrid TV services are mentioned. On the topic of accessibility there is a tendency for smart TVs to default to providing information to the user via the screen. Whilst this may be the preferred route for a lot of people users with sight loss who chose a voice interaction for it's accessibility then lose out since they can't access the feedback. A use case is in searching for television content. The user asks "Play [name of an episodic TV show]" and the voice assistant will speak "Here's what I found" while displaying search results on the TV. A useful user requirement may be the ability to request congruent user feedback (ie if voice is used for input then speech is used for feedback).

michael-n-cooper commented 3 years ago

APA thinks its work on Pronunciation is critical to this. The group would like Pronunciation to be listed as a topic for this workshop, and will certainly be interested in participating.

michael-n-cooper commented 3 years ago

APA prepared a video on pronunciation for TPAC 2020: https://www.w3.org/2020/10/TPAC/apa-pronunciation.html.

michael-n-cooper commented 3 years ago

APA review complete; over to @brewerj to complete accessibility horizontal review.

ashimura commented 3 years ago

@r12a @wseltzer @bkardell @himorin @brewerj @samuelweiler @KimPatch @RealJoshue108 @JWJPaton and @michael-n-cooper thank you very much for your inputs! And very sorry I've been swamped and couldn't respond earlier.

I've updated the Workshop proposal description at the top based on all your comments.

Also as you (might) know, I'll organize a breakout session on the expected Voice Agents Workshop during TPAC 2020. So please join the session if possible.

During the breakout session, I'd like to identify the people who are interested in the workshop as (1) the Program Committee for the workshop and (2) the speakers/participants in the workshop. Also it would be great to get further insights on the potential agenda for the expected workshop as well.

philarcher commented 3 years ago

Leaving a comment here so I'm sure to be notified of future conversations on this topic. Put me down for the PC.

KimPatch commented 3 years ago

Here are a couple more comments – a couple of things to keep in mind:

  1. What we say and hear are related, and that relationship also has a lot to do with to ease-of-use. We are mimics—we will automatically use what we’ve heard. Making sure development attends to this automatic way of learning will lower the technology’s cognitive load.

    I think the way to deal with this is to

    • Make sure to coordinate speech in and speech out
    • And like I mentioned above, make sure that vocabularies are not set in stone, but are good defaults that can be changed, saved and shared by users, similar to the way human-human language develops.
  2. This is kind of a cautionary tale. When you have devices listening all the time and you have devices that can talk, devices can tangle. This can be funny at first glance, but having to watch speech recognition like a hawk to make sure it doesn’t accidentally say something wrong when you’re speaking is draining. It’s even worse if you have to pay attention to something all the time. Users need good strategies to limit the chances for catastrophe that are more effective and doable than having these devices be yet something else that takes extra attention.

bkardell commented 2 years ago

Happy to do what I can here, feel free to reach out to me on the workshop

ashimura commented 2 years ago

Thanks a lot @bkardell !!!

mhakkinen commented 2 years ago

@ashimura, echoing @bkardell, please feel free to reach out to me on the workshop. I will also reach out to contacts in the emergency communications space to see if there is any interest in spoken interactions.

ashimura commented 2 years ago

Thank you @mhakkinen !!!

bevcorwin commented 2 years ago

Happy to help, looking forward to updates.

plehegar commented 2 weeks ago

@ashimura , breakout for TPAC 2024?