w3c / webvtt

WebVTT Standard
https://w3c.github.io/webvtt/
Other
101 stars 40 forks source link

New Feature: start addressing live captioning #318

Open silviapfeiffer opened 7 years ago

silviapfeiffer commented 7 years ago

We've had good discussions about this at FOMS and some new ideas.

I'd like to use this bug to pull everything together, since it was pushed to v2 for VTT.

Here are the related bugzilla bugs:

silviapfeiffer commented 7 years ago

To summarize the outcome of our discussions at FOMS:

silviapfeiffer commented 7 years ago

To analyse REQ1, see #319

To analyse REQ2, see #320

DanielBaulig commented 7 years ago

I would like to weigh in and let you guys know that Facebook would greatly appreciate these changes to WebVTT and the cue APIs to properly support Live captioning. Please let us know if there's anything we can do to help get this on the way.

silviapfeiffer commented 7 years ago

@DanielBaulig thanks, feedback from large sites like Facebook about the need for such features is really useful to get browsers interested

nigelmegitt commented 7 years ago

This model is highly stateful in the receiver. This means that if a user begins watching a set of incremental cues part way through their build the behaviour is likely to be inconsistent at best. Similarly if a user does a live-rewind, reproducing the correct on-screen text is harder than it needs to be. Experience from the broadcast industry shows that this leads to impaired audience experience, and newer methods for carrying live subtitles do not reproduce this behaviour - it just doesn't work very well as an architecture.

Instead, systems that reproduce the new current presentation, and add incrementally to it, avoid this problem. For example for a subtitle that builds up word by word, send:

  1. The
  2. The first
  3. The first sentence.

etc. You might want to think about an alternate architecture that is more robust in these kinds of use cases.

By the way, using 608 as a starting point here is probably unhelpful and misleading, since "mostly they create the kind of commands that CEA608 supports" is a generalisation with some very large exceptions. Much of the world does use 608, and much of it does not.

silviapfeiffer commented 7 years ago

It has a certain attraction to send "replacement" cues, i.e. a cue that would replace what's already there with something more complete:

However, there are also challenges:

nigelmegitt commented 7 years ago

These challenges are I think caused by the starting point of specifying no end time for a cue, and then keeping it open or replace it. The alternative is to specify some end time that is a prediction of the future, and rather than having a 'replace' semantic, have an 'adjust end' semantic on a previous cue. To amend my previous example:

  1. 00:00:00 -> 00:00:05 The
  2. 00:00:01 -> 00:00:06, Change cue 1 end to 00:00:01, The first
  3. 00:00:02 -> 00:00:07, Change cue 2 end to 00:00:02, The first sentence.

etc. That way if the receiver has the previous cues then it can adjust them, and if it does not then there is no action to take.

This is easy to present visually; the only downside of it is that there is no information that allows other parts of the system to identify that all three of those cues represent the same content, though that could possibly be done using some other identifier than the cue identifier, if there is a use case for it.

silviapfeiffer commented 7 years ago

So that's semantically problematic because now you're dealing with 3 different cues rather than a single one that is successively created. Also, you are still replacing the previous cue - you're not replacing content, but replacing end time.

Both of these issues are likely minor issues, so yes, this is an alternative.