Need for Authors to be able to set Social and Emotional Characteristics of TTS (text to speech)

SuzanneTaylor commented 2 years ago

[This github entry is from the Accessibility for Children Community Group]

Although more research is needed to specify which types of voices would be best for which applications at the content-level, it is important at the technology-ecosystem-level to introduce the ability to set social and emotional speech characteristics.

Situations in which setting these characteristics can be important with rough examples of markup solutions

Markup suggestions are only to give an idea of the type of control that is needed and have not been carefully crafted and edited so far. Affect-bias defines core categories such as Joy, Shame, Anger, Interest, Excitement, Startle, etc. These attributes may help us design markup that will allow authors specify appropriate voice tones.

"Friendliness"

Case where child/person might imagine the voice is sounding increasingly frustrated even when the voice is neutral (e.g. “recalculating” GPS) | friendliness="20%" friendliness="25%" etc to offset effect
Emergency / High Risk / Low Support | voice-type="reassuring respecting_urgency"
Responses to Child’s Actions in Educational or Entertainment Games | excitement="10%" joy="25%"
Child has not responded or answered - need to attract attention without sounding angry | interest="20%" friendliness="60%" (also a little louder in case child walked away, but with friendliness high so as to not sound angry)

Neutral

Test item about inferring emotional information from text alone. | voice-type="device" | Means: no tone of voice; honor device settings | voice-type="neutral" | Means: regardless of device settings, read in neutral voice
Child needs to understand they are talking to an AI | voice-type="computer" | Means: ensure this doesn’t sound human

Additional Situations to be Addressed

Education

Responses to Child’s Actions in High Stakes Testing
Responses to Child’s Actions in Testing - Correct Feedback
Responses to Child’s Actions in Testing - Incorrect Feedback
Responses to Child’s Actions in Testing - Ungraded Feedback
Responses to Child’s Actions in Instruction

AI

Cases where AI detects the child’s mood (sentiment analysis).
AI detects that child is not taking a situation or warning seriously (e.g laughing, not looking at the screen, )

AutoSponge commented 2 years ago

This reminds me of https://www.w3.org/TR/emotionml/. We may need to review it for hints of how to incorporate emotion into this spec.

brennanyoung commented 2 years ago

I strongly approve of anticipating the need for 'affective' characteristics for synthetic voices. In our use case (medical simulation), we use voices which can be in pain, out of breath, anxious and relaxed. I agree that EML is a promising place to start. Some great work in there.

w3c / pronunciation