w3c / dapt

Dubbing and Audio description Profiles of TTML2
https://w3c.github.io/dapt/
Other
5 stars 3 forks source link

Clarify how to use SSML with DAPT #121

Closed cconcolato closed 1 year ago

cconcolato commented 1 year ago

Given the overlap between DAPT and SSML, it would be good to have a clarification on how they relate and can be used together (or not). Section https://w3c.github.io/dapt/#foreign-elements-and-attributes could have an example of "proprietary" metadata mixing SSML and DAPT.

nigelmegitt commented 1 year ago

I've been wondering about adding text to speech directives too. The current two attributes, tta:speak and tta:pitch already duplicate parts of SSML, so if we allow direct inclusion of SSML we would end up with a mixed model, which seems non-ideal to me. On the other hand, just reproducing each part of SSML that we think might be useful, as additional TTML2 vocabulary, does not seem like a good idea either.

cconcolato commented 1 year ago

Can we envisage embedding a profile of SSML instead?

nigelmegitt commented 1 year ago

I would not like to create a normative dependency that means implementations must support some feature set of SSML, but I agree that we should specify the model for:

  1. Translating tta: attributes into SSML - I think it maps into the prosody element, from memory.
  2. How other SSML attributes could be added.
cconcolato commented 1 year ago

If we go with allowing both SSML syntax and tta: syntax, we should mandate that they be equivalent and if not indicate which one has precedence.

nigelmegitt commented 1 year ago

Note to self: there's another W3C spec that specifies how to inject SSML into an attribute - consider if that approach could work here.

nigelmegitt commented 1 year ago

It's https://www.w3.org/TR/spoken-html/ and is a working draft right now.

nigelmegitt commented 1 year ago

Key question for us here: exactly how should more advanced SSML be embedded syntactically into the DAPT Script?

cconcolato commented 1 year ago

Here is an real-world example of SSML:

<speak version="1.0" xml:lang="en-us" xmlns="http://www.w3.org/2001/10/synthesis">
    <prosody rate="fast">The boy smiles then backs away from the window. He looks up at a sign above the storefront. It depicts a coiled <phoneme alphabet="ipa" ph="&#712;k&#333;-br&#601;">cobra</phoneme> and the words, &quot;strike like a cobra. Cobra Kai Karate.&quot;</prosody>
</speak>

As far as I understand, the prosody part can be represented with tta:rate but the phoneme part is not currently possible.

cconcolato commented 1 year ago

As discussed, because other groups are looking into similar topics, we don't want to jump into a conclusion yet. The proposal is to add a note to the DAPT specification saying something like:

Part of the vocabulary of DAPT overlaps with SSML. This version of the specification does not specify how SSML can be either generated from DAPT or embedded into DAPT. Future versions of this specification may do so.

nigelmegitt commented 1 year ago

One option is to specify a complex mapping to the SSML <voice> element from attributes on <ttm:agent>.

css-meeting-bot commented 1 year ago

The Timed Text Working Group just discussed SSML, and agreed to the following:

The full IRC log of that discussion <cpn> Subtopic: SSML
<cpn> s/SSML/Relationship with SSML/
<nigel> s/Relationship with SSML/Clarify how to use SSML with DAPT w3c/dapt#121
<nigel> Github: https://github.com/w3c/dapt/issues/121
<cpn> Nigel: At the moment, in TTML2 we have two audio styling attributes that direct the use of text to speech
<cpn> ... They are derived from SSML semantics
<cpn> ... But the vocabulary and structure is different
<cpn> ... An obvious direction we should allow is to allow a richer feature set from SSML so people can direct the text to speech more directly
<cpn> ... We could define all the syntax in TTML and a mapping to SSML, or inject SSML into the TTML document
<cpn> ... But then, what happens to the two bits of vocabulary already in TTML2
<cpn> ... If injecting SSML, do it with an element structure or an attribute?
<cpn> ... The new thing, is thinking about the voice characteristics. Maybe a good idea is to associate the voice with the agent, then your mapping to SSML would pull in that metadata and use it
<cpn> ... We always had a rule that metadata doesn't drive presentation, but we'd be going against that
<cpn> Cyril: The one other detail, if we were to embed SSML in DAPT, the TTML behaviour is to prune elements not in the TTML namespace, for validation
<cpn> ... I wonder if the entire element would be ignored for the purpose of rendering, or would its internal text content be used. That would make a big difference
<cpn> Pierre: So if you wrap text in an unknown element, would it still be used?
<cpn> Cyril: Yes, it's something you can do in HTML
<cpn> ... Nigel, I think your point about using agent to indicate voice characteristics, I like the idea
<cpn> ... Not a problem in the metadata vocabulary. I think it's a good way to do it
<cpn> Nigel: OK, it sets us down an interesting path, of how to map SSML semantics into TTML. Need to plan ahead, do a thought experiment of the best mapping into TTML if we need them in the future
<cpn> Cyril: Your comment in the PR 157 about proprietary metadata is relevant
<nigel> -> https://github.com/w3c/dapt/pull/157#discussion_r1234279959
<cpn> ... The metadata we're thinking about is what speech generation engine you want to use, etc. Does SSML cover all that?
<cpn> Nigel: [Reviewing the details] They go quite far, I think
<cpn> ... The synthesis processor specifically, I'm not sure you can specify
<cpn> ... I think the idea is you can pass the SSML to any processor, but doesn't contain a pointer to the synthesis processor itself
<cpn> ... I would need to check, but I think that's how it is
<cpn> ... Yes, the synthesis processor external
<cpn> Nigel: For TTML validation it would prune, also IMSC rendering
<nigel> s/IMSC/imscJS
<cpn> Cyril: Is there a normative statement for that?
<cpn> Pierre: There's a note, if you try to feed a TTML2 document with ruby to a TTML1 processor, it may prune the entire element
<cpn> ... I wouldn't count on the presentation processsor keeping the content of the element
<cpn> ... Why not use a span with the content if you want to keep it?
<cpn> Cyril: Need to define a transformation between TTML and SSML could be in XSLT
<cpn> Nigel: Construct an intermediat docuemnt that prunes elements if they're @@
<cpn> ... You could assert that some SSML element must be included in some presentational element in DAPT
<cpn> ... A simple reading, you wouldn't expect that
<cpn> Cyril: If your implementation is both a TTML and SSML processor, you may keep it
<cpn> Pierre: @@2
<cpn> Cyril: In DAPT we could say something about how to mix SSML and TTML, that would be defining behaviour in fuzzy areas in TTML
<cpn> ... The benefit of basing DAPT on TTML is you can embed it in generic TTML processors
<cpn> Pierre: If you want the benefit of TTML, stay with TTML. But if you need something other than TTML, imscJS or other processors would eventually recognise it
<cpn> ... If not needed, don't do it, but if it's needed it's needed
<cpn> Cyril: Mapping to a different stucture seems like unnecessary work, and would have to be maintained
<cpn> Pierre: What's different between them?
<nigel> +1 to avoiding unnecessary work, which it seems to be
<cpn> Nigel: More granular directives for text to speech
<cpn> Pierre: Do the opposite, embed TTML in SSML?
<cpn> Cyril: But the DAPT document is the whole thing
<cpn> ... The example I put in issue 121, is because Netflix uses some SSML engine for voice synthesis
<cpn> ... At the moment we have a proprietary TTAL spec, generate SSML, then send to an API
<cpn> ... Speech rate is covered, but there's a phoneme span that gives pronunciation
<cpn> Pierre: I linked to a new spec for spoken presentation in HTML. It uses attributes instead of elements
<cpn> Nigel: It describes both strategies, seems they're not sure which is the best to use
<cpn> Cyril: So we could say use the same attribute
<cpn> ... That mapping works for us too
<cpn> Pierre: Presumably. HTML has the same issues as us
<cpn> Nigel: These things can be on spans
<cpn> Pierre: And semantically they should be, they convey additional semantics on text
<cpn> Cyril: We could adopt their strategy but not their spec
<cpn> Nigel: We could define a dapt:s namespace that exactly map to the SSML voice element content
<cpn> Cyril: Which group is working on the spoken presentation in HTML?
<cpn> ... It's a TF in APA WG
<cpn> Nigel: The attribute approach seems nice, we're gravitating towards that
<cpn> Cyril: I prefer the multi-attrbute rather than single attribute approach
<nigel> SUMMARY: Gravitating towards multi-attribute approach maybe in a ssml-specific DAPT namespace