Audio related feature changes.

nigelmegitt commented 6 years ago

Having reviewed the BBC's ability to provide presentation implementations for the audio features, I would like to propose the following changes to at-risk features:

remove the #embedded-audio feature designator, because it requires playback of audio resources embedded in the TTML2 file which BBC does not expect to be able to demonstrate. I would not like to remove the syntactical possibility of embedded audio resources directly, but arguably we may need to do that also. If I can find a way for BBC to implement this in the next day or so then I will propose "un-removing" #embedded-audio - I'm merely taking a conservative line here given the limited amount of remaining time.
change #embedded-audio to #audio in #audio-description to match the previous change.
Remove #embedded-audio, #gain and #pan from #audio-speech because I do not think we can implement them all together straightforwardly, at least not so they are all applied at once to the same text using a combination of Web Audio API and Web Speech API, as they are in their current states; other implementation techniques could be made to work, such as making external asynchronous calls to a suitable text to speech service and playing back the resulting audio resource, but that requires a further degree of complexity.
Remove #speech because it is not clear what it means for an implementation to "support" a speech synthesis processor, for example must the implementation include one or merely access one remotely? If the implementation can successfully implement #speak and #pitch then it has adequate support; further support is not required, in my view.

skynavga commented 6 years ago

regarding #embedded-audio, this is trivial to implement using a transformation step in your processing pipeline that extracts the embedded audio from the TTML document, uploads it to your server of choice, then rewrites the relevant TTML element, e.g., one of an audio, data, or source element as appropriate, to refer to the now server based resource;

regarding #audio-description, yes we can pare that down, though my (conceptual) model of speech processing is to convert text to speech as audio resources, at which point it looks just like the non-speech audio processing model;

regarding #speech, support means "support by any means possible"; you are over-thinking the significance of support here, and conceptually, we need something to allow an author to indicate they require text to speech functionality (in whatever way the implementation wishes to provide it); note that #speech is really not any different than #lineBreak-uax14 which uses identical language re: supports; if you want, we could add a note that indicates the support may be built-in or remoted

so, overall, I think you can positively report support for #embedded-audio (if you can do the above trivial transform - which may involve a human doing the transform) and #speech (if you can do any kind of TTS, local or remote)

if so, then that leaves only your third bullet above, which we can do without any significant impact (IMO)

skynavga commented 6 years ago

btw, handling a mixture of audio resources and TTS is also a straightforward transformation step:

(1) ensure every text fragment is wrapped in a span; (2) perform TTS on each such span's text content, storing the result in an audio resource; (3) insert an audio element child for each such span pointing at resource;

you could even concatenate all such TTS audio segments into a single audio resource and then use time based URIs to access the segments in that resource

nigelmegitt commented 6 years ago

@skynavga step 2 of your proposal at https://github.com/w3c/ttml2/issues/990#issuecomment-419949018 doesn't seem to be possible with the Web Speech API, for example, though it would be possible for other external speech to text resources. Any implementation approach that uses client-side rendering with that particular API needs to initiate the text to speech process at the appropriate "play" time for the text.

nigelmegitt commented 6 years ago

we need something to allow an author to indicate they require text to speech functionality

#speak does that already.

if you can do the above trivial transform - which may involve a human doing the transform

Really? You think it would be acceptable as part of a demonstrated solution to involve a human in the "implementation"? That's stretching my expectation.

I don't dispute that the transformation is trivial; I'm more concerned about the time needed to do it; as I said in the issue, I raised it to be conservative, and will try to implement something there - I agree that a pre-processor approach should be relatively straightforward.

skynavga commented 6 years ago

Really? You think it would be acceptable as part of a demonstrated solution to involve a human in the "implementation"? That's stretching my expectation.

Yes, I do. Or if you like, substitute "AI" for human if you want a quicker result or "monkey" if slower. The point is that it can be done by a rudimentary (even manual) process which satisfies my definition of transformation. [We don't define the transformation process, mind you.]

skynavga commented 6 years ago

step 2 of your proposal at #990 (comment) doesn't seem to be possible with the Web Speech API, for example, though it would be possible for other external speech to text resources. Any implementation approach that uses client-side rendering with that particular API needs to initiate the text to speech process at the appropriate "play" time for the text;

who says you have to use the Web Speech API? you can use any TTS technology in a pre-processing step like this; who says it has to be client side? a server side implementation is just fine (think of the server as a remoted part of the client)

nigelmegitt commented 6 years ago

you can use any TTS technology in a pre-processing step like this

@skynavga I'm referring to a particular implementation approach; I agree there's no spec requirement to do it this way. Given the BBC seems to be the only one implementing the presentation feature here, and that's the approach that we are likely to take, it becomes relevant to what we can achieve in the time available.

css-meeting-bot commented 6 years ago

The Timed Text Working Group just discussed Audio related feature changes ttml2#990, and agreed to the following:

RESOLUTION: @skynavga to change #embedded-audio to #audio in #audio-description
RESOLUTION: @skynavga to remove #embedded-audio, #gain and #pan from #audio-speech
SUMMARY: If #embedded-audio is unlikely to be implemented, consider removing later; Nigel to inform the group if this is going to be the case by 21st September.

The full IRC log of that discussion

<nigel> Topic: Audio related feature changes ttml2#990
<nigel> github: https://github.com/w3c/ttml2/issues/990
<nigel> Nigel: Summarising discussion before we hit this agenda topic,
<nigel> .. Glenn to prepare pull request removing #embedded-audio, #gain and #pan from #audio-speech
<nigel> .. I'd like to change #embedded-audio to #audio in #audio-description.
<nigel> Glenn: I'm happy to do that, especially if it's a barrier to getting the spec out the door.
<nigel> Nigel: I think it probably will be.
<nigel> Glenn: That leaves the question if there will be a demonstration of #embedded-audio
<nigel> .. If you do implement #embedded-audio should we leave them in #audio-speech?
<nigel> Nigel: No, still remove them please.
<nigel> Glenn: I'm okay with that.
<nigel> .. That's changing the #audio-description and #audio-speech feature.
<nigel> Nigel: I propose we leave #embedded-audio in for the time being and I will signal as soon
<nigel> .. as I know if we will be able to do it.
<nigel> Glenn: You also proposed removing #speech which I argued against.
<nigel> .. I prefer to leave it in. I know you suggested signalling it indirectly through #speak
<nigel> .. which is possible but I don't like the indirection.
<nigel> Nigel: I think it's more direct.
<nigel> Glenn: There's a corner case for transformation processing, does #speak imply that a
<nigel> .. text to speech processor is required for a transformation processor?
<nigel> Nigel: I would scope the requirement for a speech processor to presentation semantics of #speak.
<nigel> .. Put it this way, we don't have a processor feature for a font rasteriser, but any presentation
<nigel> .. processor needs one, for visual presentation of text.
<nigel> Glenn: That's true, but that feels like a comment for a CR change rather than a change to
<nigel> .. make now. If this feature does no harm we should leave it in.
<nigel> Nigel: That's acceptable, but not ideal.
<nigel> Nigel: Should we open a pull request now making the changes we've agreed to, and then
<nigel> .. another to make any other changes needed?
<nigel> Glenn: Yes I would prefer to do that.
<nigel> RESOLUTION: @skynavga to change #embedded-audio to #audio in #audio-description
<nigel> RESOLUTION: @skynavga to remove #embedded-audio, #gain and #pan from #audio-speech
<nigel> Glenn: Note the text in 9.3.1 that connects the output of a speech synthesis processor
<nigel> .. to the web audio input.
<nigel> SUMMARY: If #embedded-audio is unlikely to be implemented, consider removing later; Nigel to inform the group if this is going to be the case by 21st September.

skynavga commented 6 years ago

Merged early per WG resolution and PR processing.

nigelmegitt commented 6 years ago

Reopening pending confirmation (from me) of the status of #embedded-audio as per https://github.com/w3c/ttml2/issues/990#issuecomment-421047917.

nigelmegitt commented 6 years ago

Confirming we have a working implementation of the #embedded-audio tests, and closing.

w3c / ttml2

Audio related feature changes. #990