New Feature: start addressing live captioning

silviapfeiffer commented 7 years ago

We've had good discussions about this at FOMS and some new ideas.

I'd like to use this bug to pull everything together, since it was pushed to v2 for VTT.

Here are the related bugzilla bugs:

silviapfeiffer commented 7 years ago

To summarize the outcome of our discussions at FOMS:

live captions currently come from all kinds of sources in all kinds of formats - mostly proprietary, but mostly they create the kind of commands that CEA608 supports
these commands don't provide text for full captions (which are the equivalent of a VTT cue) but successively build a cue with additional text and some positioning, styling and delete commands
the key requirement on browsers for live captioning is on the ability to replicate this successive cue building functionality and provide the rendering (i.e. sub-cue resolution to achieve low latency rendering) -> REQ1: JS API to start un-finished cues, change them, and end them as needed
a second requirement is to make a "recording" of such successive cue building functionality that can be interleaved as a text track into video streams (i.e. sub-cue resolution, somewhat analogous to how keyframes and diff-frames work for video streams, where the keyframes are the new cues and the diff-frames the changes to the cue) - this should be usable as in-band tracks or with DASH or HLS etc -> REQ2: add cue fragments to WebVTT
a possible third requirement could be an agreement on the protocol that is being used to send live captions to the browser, i.e. the format in which to send the commands. However, many video players will accept captions in many different formats, then transform them client side to the new commands (as defined in REQ1). Also, the simplest new protocol would be to use the cue fragments WebVTT file as defined in REQ2 and use HTTP byte range requests to retrieve them from a server API (HLS or DASH style).

silviapfeiffer commented 7 years ago

To analyse REQ1, see #319

To analyse REQ2, see #320

DanielBaulig commented 7 years ago

I would like to weigh in and let you guys know that Facebook would greatly appreciate these changes to WebVTT and the cue APIs to properly support Live captioning. Please let us know if there's anything we can do to help get this on the way.

silviapfeiffer commented 7 years ago

@DanielBaulig thanks, feedback from large sites like Facebook about the need for such features is really useful to get browsers interested

nigelmegitt commented 7 years ago

This model is highly stateful in the receiver. This means that if a user begins watching a set of incremental cues part way through their build the behaviour is likely to be inconsistent at best. Similarly if a user does a live-rewind, reproducing the correct on-screen text is harder than it needs to be. Experience from the broadcast industry shows that this leads to impaired audience experience, and newer methods for carrying live subtitles do not reproduce this behaviour - it just doesn't work very well as an architecture.

Instead, systems that reproduce the new current presentation, and add incrementally to it, avoid this problem. For example for a subtitle that builds up word by word, send:

The
The first
The first sentence.

etc. You might want to think about an alternate architecture that is more robust in these kinds of use cases.

By the way, using 608 as a starting point here is probably unhelpful and misleading, since "mostly they create the kind of commands that CEA608 supports" is a generalisation with some very large exceptions. Much of the world does use 608, and much of it does not.

silviapfeiffer commented 7 years ago

It has a certain attraction to send "replacement" cues, i.e. a cue that would replace what's already there with something more complete:

it's easier to ascertain that cue fragments had not been lost
it's easier to play backwards / rewind
it's easier to seek to a location and just play from there with the whole cue available

However, there are also challenges:

need to identify a "replacement" for an existing cue and replace that cue's text with the new one
larger bandwidth use with text (though that is minimal compared to audio or video)
if you're unable to identify that a cue continues, then you need to be quite specific about the timing on these "replacement" cues, i.e. they need to follow each other immediately - in this case, they cannot be identified as a single cue any more but are actually subsequent cues.

nigelmegitt commented 7 years ago

These challenges are I think caused by the starting point of specifying no end time for a cue, and then keeping it open or replace it. The alternative is to specify some end time that is a prediction of the future, and rather than having a 'replace' semantic, have an 'adjust end' semantic on a previous cue. To amend my previous example:

00:00:00 -> 00:00:05 The
00:00:01 -> 00:00:06, Change cue 1 end to 00:00:01, The first
00:00:02 -> 00:00:07, Change cue 2 end to 00:00:02, The first sentence.

etc. That way if the receiver has the previous cues then it can adjust them, and if it does not then there is no action to take.

This is easy to present visually; the only downside of it is that there is no information that allows other parts of the system to identify that all three of those cues represent the same content, though that could possibly be done using some other identifier than the cue identifier, if there is a use case for it.

silviapfeiffer commented 7 years ago

So that's semantically problematic because now you're dealing with 3 different cues rather than a single one that is successively created. Also, you are still replacing the previous cue - you're not replacing content, but replacing end time.

Both of these issues are likely minor issues, so yes, this is an alternative.

w3c / webvtt

New Feature: start addressing live captioning #318