Spoken subtitle - Githubissues

porero commented 5 years ago

Is your feature request related to a problem? Please describe. When a media content is in a different language to that spoken, and is subtitles, people who can’t read subtitles don’t have access to content, because what they hear is in a different language.

Describe the solution you'd like A clear and concise description of what you want to happen.

Watching a children Japanese cartoon not dubbed only subtitled in Catalan, unless you can read subtitles you don’t understand what is happening
Within news programmes, some interventions are in the original language with subtitles. Unless you can read subtitles you don’t understand what they are saying, Ex. Theresa May explaining Brexit agreement in Catalan TV.
Watching VoD movie in OV with subtitles.
Catch up TV/Start over watching news.
Watching diff language/same language content in a smartphone subtitles are too small
Watching diff language/same language content in a public screen too far away subtitles are not legible, for example going to a Obama public conference when I don’t speak English, or in a museum.
Live productions (opera/theatre) subtitles

Describe alternatives you've considered

Subtitles may be burnt or a text file
Regarding audio generation: Generated at broadcast or client sides
Two solutions that can be applied to all cases:
- audio is generated at the broadcast/server side and sent as an audio channel using audio description channel or additional audio channel:
- origin mix.
- receiver/client mix.
- audio is generated at the client side from:
  - streamed text with subtitles.
  - file with audio segments.
  - OCR of burned-in subtitles.

State if you intend for this requirement to be met by a particular specification

Does this requirement represent a change in scope

Additional context

Use cases:

TTS of existing subtitles on the client side in companion screen
OCR for burnt in subtitles
Smartphone screen reader to be applied to subtitle file
In case TTML2 is the adequate format for audio subtitles we need to developed an application to speech to text from this file format.
Generate the TTS in HbbTV, or the deliver generated TTS in HbbTV

skynavga commented 5 years ago

This issue is not in a valid format, as it does not include the information requested in the requirement template. Please edit your comment in the above or close the issue if you do not intend to provide this information. Also, please be aware that TTML2 already provides support for spoken subtitles.

porero commented 5 years ago

Thank you. I am writing further requirements for spoken subtitles, and I hope to post them soon. I have never posted to Github, so I had to generate an account and understand how to do it. Sorry but I will deliver very soon.

skynavga commented 5 years ago

@porero thanks; please paste your detailed requirements into the initial comment above (by using the edit option); also, please be sure to identify (1) how your requirements are not met by current TTML2 audio and speech to text functionality, and (2) whether your proposal applies to IMSC and/or TTML; if you are not familiar with IMSC, it is a profile of TTML, and includes only support for a subset of the features of TTML, e.g., it does not (at present) support the audio or speech to text features.

porero commented 5 years ago

[Updated: content of this comment moved to the top]

skynavga commented 5 years ago

Hello @porero. I have read your above elaboration, and I must admit that I do not understand what you are asking for that isn't already supported by TTML (in one form or another). Your requests seem to boil down to a need to provide text to speech (although above you mention "speech to text" which confuses me).

Much of what you mention above pertains to applications that make use of TTML, and not TTML specific features or technology. We (the TTWG) view TTML (and IMSC and other profiles, such as SMPTE-TT and EBU-TT) as enabling technologies to be integrated into and employed by applications in a variety of domains, only one of which is the delivery of caption or subtitle data. That said, the TTWG does not undertake to define specific applications of TTML, though we have, at times, focused on the needs of specific applications to drive the definition of new features (e.g., Japanese subtitles, live captioning, karaoke, etc.) However, unless we can identify what specific features are missing, we cannot take further action.

Regarding the specifics of your proposal, we need more information such as:

What is missing from the current audio and text to speech mechanism currently defined in TTML2?
If there is a missing feature (and I do not yet see one described), and, were it to be added to TTML, then is it also needed in IMSC?

Finally, I would urge you to carefully review the details of audio and text to speech support in TTML2 so that we may base this conversation on a common understanding of what is already present, and what might be missing.

Absent the identification of specific missing features, I fear the TTWG will (eventually) close this issue without taking any action.

nigelmegitt commented 5 years ago

Please note that @skynavga is a member of the Timed Text Working Group and the group has not discussed this issue yet; as such his https://github.com/w3c/tt-reqs/issues/13#issuecomment-451319068 does not represent a consensus view of the group at this time.

As Chair of the TTWG I would like to thank you for raising this @porero and congratulate you (and sympathise) for using GitHub in this way for the first time. What you've done is fine for us to make a start with understanding your submission, and I may go and edit the opening comment at the top of the issue to match what you added in https://github.com/w3c/tt-reqs/issues/13#issuecomment-451298677, for clarity, if that's okay with you? We may also have some follow-up questions so please watch this space.

For the benefit of others watching this, as it happens @porero and I had a chance to discuss this briefly around a month ago, and my understanding was that the core requirement is to be able to provide a user experience where someone who does not understand the original language audio and cannot read the visual representation of the (audio) spoken words translated into a language they do understand, instead gets to hear an audio representation of that translation text, co-timed with the audio. This therefore makes that content accessible. This practice is in use in some countries already - I remember hearing it in use in the Netherlands many years ago.

I agree with @skynavga that there are likely to be parts of the big picture requirement that we cannot handle in TTWG, for example optical character recognition of burnt-in translation subtitle text, which does not seem to be within our scope.

However there are other parts of this requirement that may need some modification to TTML or IMSC. For example, right now we can specify the language of text, and the timing, and whether or not the presentation of that text should be "forced", so one solution might be to recommend in implementations that all forced subtitles/captions are made available to a screen reader, which can be done. However another might be to add richer data, i.e. to label the text using a ttm:role of "translation" (not currently in the list) and use that data, or to identify the original language - again there's no scope for this in TTML now using native elements or attributes.

Another side to this is what the user experience should be, which may or may not be in scope of the TTWG's work. For example, should we recommend that implementations provide options for presenting translations (however they have been identified) in vision only, in audio only, or in both? Should the default audio renderer be a screen reader, or something else?

As mentioned before, TTWG will need to make a call on which of these are in scope and achievable, and which are not. And it may be that no change is needed in the TTML or IMSC specifications at all.

porero commented 5 years ago

Dear Nigel, please go ahead and edit the opening comment at the top of the issue to match what I added in #13 (comment) thank you.

skynavga commented 5 years ago

Just to be clear, I am speaking with my editor hat on, not simply as a member.

nigelmegitt commented 5 years ago

@porero Thanks, I've done that.

andreastai commented 5 years ago

@porero Thanks for submitting this important request. I very much like how you added real world example for this use case.

I support this request and I think that two additions to TTML, IMSC and/or a profile are needed:

a) adding syntax that express the desired behaviour b) adding the desired client behaviour for audio rendering

I also think that this not a niche requirement but a requirement that supports an important accessibility service. This should be taken into consideration when there is a discussion if this is in scope.

skynavga commented 5 years ago

@porero @tairt this thread mentions three different possible high-level requirements, as far as I can tell:

perform speech to text on an audio track;
perform image to text on a video track;
perform language translation on a timed text track.

While the first two of these are interesting research projects, they are clearly out of scope for TTML/IMSC.

The third is also interesting, as part of an application environment that makes use of TTML/IMSC, but here I don't see a specific ask that would lead to the possibility of, say, "adding syntax".

What I would need to see to proceed (in any fashion at all on this request) is an actual implementation of a real world system that uses language translation on existing text track content, from which specific proposals might appear that would suggest adding any specific syntax.

As it is, the existing metadata (and general language extensibility) support in TTML/IMSC already supports this last (of three) applications, so I again conclude there is no requirement for a new syntax or feature being proposed here.

nigelmegitt commented 5 years ago

@skynavga My reading of this is that there may be gaps in the signalling aspect of when subtitles are a translation vs when they are in the base language, and how to trigger the desired presentation behaviour, i.e. there may be (or may not be) syntactic and semantic gaps relating to this in our specs.

I agree that content processing tasks like speech or image to text, or automated translation, are out of scope of the document formats we are chartered to work on in TTWG, except insofar as it should be possible to be able to express the output of those tasks in a TTML document.

skynavga commented 5 years ago

@nigelmegitt It is not the design intent of TTML to define or employ semantic markup, at least beyond what is currently supported by ttm:role. The original intent of ttm:role was merely to support interoperation with certain other caption/subtitle formats (which I don't even recall at the moment).

As the TTML metadata and markup systems are extensible on a per-application basis, TTML already supports any and all markup that might be desired by a specific application.

The present proposal (this issue) does not make reference to any specific application and does not give any hint of what markup they are seeking, let alone how such markup may have any presentation semantics for TTML. As such, I see this issue as non-actionable, and certainly not suggestive of any new features, either semantic or syntactic.

If the proposer develops a specific application and brings to our attention a requirement for specific semantic markup, then they are free to do so in the future.

nigelmegitt commented 5 years ago

This was picked up by TTWG today. The discussion was not fully recorded in the minutes due to the IRC server going down and coming back again mid-way through, but some parts are available at https://www.w3.org/2019/01/31-tt-minutes.html#item08

SUMMARY: We think the requirement here is to signal translations, and describe (potential) workflows for triggering TTS based on translations.

Flag timed text as a translation so that it can be used to drive text to speech.
@nigelmegitt to try to recast the requirements as per the TTWG discussion and check in with @porero.

palemieux commented 5 years ago

Flag timed text as a translation so that it can be used to drive text to speech.

Couldn't the forced narrative track be used, since it includes timed text intended for all viewers of a particular language?

skynavga commented 5 years ago

My position is

the requirements written above do not (in my mind) translate (pun intended) to a requirement to signal translations;
the requirements as I understand them are clearly in the domain of application semantics, and, as such, are out of scope for TTML; as it is, TTML supports TTS, which can be used in a variety of ways by different applications.

css-meeting-bot commented 5 years ago

The Timed Text Working Group just discussed Spoken subtitle tt-reqs#13.

The full IRC log of that discussion

<nigel> Topic: Spoken subtitle tt-reqs#13
<nigel> github: https://github.com/w3c/tt-reqs/issues/13
<nigel> Glenn: Nigel have you managed to contact the issue raiser on this?
<nigel> Nigel: No, sorry, thanks for the reminder, I need to follow up with Pilar.

w3c / tt-reqs

Spoken subtitle #13