Closed nigelmegitt closed 6 years ago
regarding #embedded-audio
, this is trivial to implement using a transformation step in your processing pipeline that extracts the embedded audio from the TTML document, uploads it to your server of choice, then rewrites the relevant TTML element, e.g., one of an audio
, data
, or source
element as appropriate, to refer to the now server based resource;
regarding #audio-description
, yes we can pare that down, though my (conceptual) model of speech processing is to convert text to speech as audio resources, at which point it looks just like the non-speech audio processing model;
regarding #speech
, support means "support by any means possible"; you are over-thinking the significance of support here, and conceptually, we need something to allow an author to indicate they require text to speech functionality (in whatever way the implementation wishes to provide it); note that #speech
is really not any different than #lineBreak-uax14
which uses identical language re: supports; if you want, we could add a note that indicates the support may be built-in or remoted
so, overall, I think you can positively report support for #embedded-audio
(if you can do the above trivial transform - which may involve a human doing the transform) and #speech
(if you can do any kind of TTS, local or remote)
if so, then that leaves only your third bullet above, which we can do without any significant impact (IMO)
btw, handling a mixture of audio resources and TTS is also a straightforward transformation step:
(1) ensure every text fragment is wrapped in a span;
(2) perform TTS on each such span's text content, storing the result in an audio resource;
(3) insert an audio
element child for each such span pointing at resource;
you could even concatenate all such TTS audio segments into a single audio resource and then use time based URIs to access the segments in that resource
@skynavga step 2 of your proposal at https://github.com/w3c/ttml2/issues/990#issuecomment-419949018 doesn't seem to be possible with the Web Speech API, for example, though it would be possible for other external speech to text resources. Any implementation approach that uses client-side rendering with that particular API needs to initiate the text to speech process at the appropriate "play" time for the text.
we need something to allow an author to indicate they require text to speech functionality
#speak
does that already.
if you can do the above trivial transform - which may involve a human doing the transform
Really? You think it would be acceptable as part of a demonstrated solution to involve a human in the "implementation"? That's stretching my expectation.
I don't dispute that the transformation is trivial; I'm more concerned about the time needed to do it; as I said in the issue, I raised it to be conservative, and will try to implement something there - I agree that a pre-processor approach should be relatively straightforward.
Really? You think it would be acceptable as part of a demonstrated solution to involve a human in the "implementation"? That's stretching my expectation.
Yes, I do. Or if you like, substitute "AI" for human if you want a quicker result or "monkey" if slower. The point is that it can be done by a rudimentary (even manual) process which satisfies my definition of transformation. [We don't define the transformation process, mind you.]
step 2 of your proposal at #990 (comment) doesn't seem to be possible with the Web Speech API, for example, though it would be possible for other external speech to text resources. Any implementation approach that uses client-side rendering with that particular API needs to initiate the text to speech process at the appropriate "play" time for the text;
who says you have to use the Web Speech API? you can use any TTS technology in a pre-processing step like this; who says it has to be client side? a server side implementation is just fine (think of the server as a remoted part of the client)
you can use any TTS technology in a pre-processing step like this
@skynavga I'm referring to a particular implementation approach; I agree there's no spec requirement to do it this way. Given the BBC seems to be the only one implementing the presentation feature here, and that's the approach that we are likely to take, it becomes relevant to what we can achieve in the time available.
The Timed Text Working Group just discussed Audio related feature changes ttml2#990
, and agreed to the following:
RESOLUTION: @skynavga to change #embedded-audio to #audio in #audio-description
RESOLUTION: @skynavga to remove #embedded-audio, #gain and #pan from #audio-speech
SUMMARY: If #embedded-audio is unlikely to be implemented, consider removing later; Nigel to inform the group if this is going to be the case by 21st September.
Merged early per WG resolution and PR processing.
Reopening pending confirmation (from me) of the status of #embedded-audio
as per https://github.com/w3c/ttml2/issues/990#issuecomment-421047917.
Confirming we have a working implementation of the #embedded-audio
tests, and closing.
Having reviewed the BBC's ability to provide presentation implementations for the audio features, I would like to propose the following changes to at-risk features:
#embedded-audio
feature designator, because it requires playback of audio resources embedded in the TTML2 file which BBC does not expect to be able to demonstrate. I would not like to remove the syntactical possibility of embedded audio resources directly, but arguably we may need to do that also. If I can find a way for BBC to implement this in the next day or so then I will propose "un-removing"#embedded-audio
- I'm merely taking a conservative line here given the limited amount of remaining time.#embedded-audio
to#audio
in#audio-description
to match the previous change.#embedded-audio
,#gain
and#pan
from#audio-speech
because I do not think we can implement them all together straightforwardly, at least not so they are all applied at once to the same text using a combination of Web Audio API and Web Speech API, as they are in their current states; other implementation techniques could be made to work, such as making external asynchronous calls to a suitable text to speech service and playing back the resulting audio resource, but that requires a further degree of complexity.#speech
because it is not clear what it means for an implementation to "support" a speech synthesis processor, for example must the implementation include one or merely access one remotely? If the implementation can successfully implement#speak
and#pitch
then it has adequate support; further support is not required, in my view.