w3c / webvtt

WebVTT Standard
https://w3c.github.io/webvtt/
Other
103 stars 40 forks source link

Live captioning - incremental cues review #320

Open silviapfeiffer opened 7 years ago

silviapfeiffer commented 7 years ago

To address REQ2 of #318 , we are after an extension of the WebVTT file format.

The principle idea is that we map the TextTrack API calls from #319 to how we would archive them in a WebVTT file to replicate the functionality.

The approach we try out here is to use the 00:00:00 cue timestamps as the means to separate cues into smaller incremental cues. <now()> will be acue timestamp with the now() time.

Ref 608 control commands TextTrackCue API calls WebVTT file addition
1 start caption text / resume caption text / resume direct captioning new VTTCue(now(), NULL, '') - make sure to set the defaults as required by 608 - then: textTrack.addCue(cue) now() --> NULL
2 add a character cue.text += char <now()> char
3 next row down toggle (includes end all style) cue.text += '\n' (may need to end </c>,</i>,</b>,</u>) <now()> \n
4 row indicator (one of 15 rows) cue.line = row (whichever row calculated) <now()> <set line=row>
5 underline toggle cue.text += "<u>" or cue.text += "</u>" (need to keep track of toggle state) <now()> <u> or <now()> </u>
6 style change (one of 7 text colors and italics) cue.text += "<c.white>" (need to have the color style classes pre-defined) and cue.text += "<i>" <now()> <c.white> <i> etc.
7 8 ident positions cue.position = offset (whichever offset calculated from ident pos) <now()> <set position=offset>
8 8 background colors cue.text += "<c.bg_white>" (need to have the background color style classes pre-defined) <now()> <c.bg_white>
9 backspace cue.text = cue.text.substr(0, cue.text.length - 1) <now()> <set substr=(0,-1)>
10 delete till end of row cue.text = cue.text.substr(0, cursor_pos) (need to keep track of the 608 cursor position) <now()> <set substr=(0, cursor_pos)>
11 rollup caption with 2, 3 or 4 rows new VTTRegion() then region.lines = x N/A - make sure any required regions have been defined in the header
12 flash on (srlsy?) cue.text += "<c.blink>" <now()> <c.blink>
13 erase displayed memory (clear screen) cue.text = '' <now()> <set substr=(0,0)>
14 carriage return (scroll lines up) cue.endTime = now(); cue.region = region; new VTTCue(); <now()> <set region=ID> \n\n now() --> NULL
15 end of caption cue.endTime = now() <now()> <set endTime=now()>
16 clear screen (erase display memory) cue.text = '' <now()> <set substr=(0,0)>
17 tab offset 1/2/3 (add whitespace) cue.text += " " * num_space (calculate numspace from tab offset as per https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html ) <now()> (add required number of spaces)

Looks like what we need is a way to change cue settings and cut cue length half-way through cues, as well as an undefined end time that can be set at a later stage.

silviapfeiffer commented 6 years ago

goal from FOMS: add test cases

silviapfeiffer commented 6 years ago

discussion at FOMS: separate between two use cases:

1/ live broadcasting

This has near-realtime requirements and focuses on the use of WebVTT with MSE/HLS. Main requirement there is to allow undefined end times and update them on cues.

2/ real-time video/audio communication (also called Realtime Captioning RTC)

This has realtime requirements with an ability to support the "editing"-type functionality of 608/708. This use case motives the spec in this bug.

dwsinger commented 6 years ago

I think we may need VTT file-level support for the concept "I am updating this cue" so that clients can work out what's new/changed. One case is incremental builds e.g. after speech recognition I I think I think we may I think we may need I think we may need incrementally

and so on. Another case is where a cue has to be sent immediately but can then be edited and fixed.

I would very much like to understand best current practices in captioning of video telephony and conferences (if any).

micolous commented 2 years ago

I would very much like to understand best current practices in captioning of video telephony and conferences (if any).

Running a recent conference, we had stenographers using streamtext.net to provide subtitles. These were delivered out of band with the rest of the live (HLS) video feed in a separate browser window.

We were lucky with synchronisation issues, because the stenographers and our video distribution service were both in US, and most participants were in Australia behind a couple of layers of CDN -- the stenographers were up to 10 seconds ahead of most viewers, but it varied... there was a viewer who was 2 minutes behind!

Under the hood, Streamtext's viewer web UI polls a JSON API to get updates, which works as follows:

  1. send request for the latest message ID
  2. wait 0.5 seconds
  3. request all messages since last known message ID
  4. go to step 2

In "basic" mode, the messages themselves are simple strings which are displayed immediately, with a couple of control characters:

There are more sophisticated output options with formatting, but the simple case is just that. There is also no timecode information available.

OBS has support for 608-style embedded captions, but the only way to deliver it is via a third-party WebSockets plugin, which lacks timecodes.

YouTube Live can ingest either 608-style captions (with timecodes), or out-of-band, and reference a few third-party tools which integrate (using a private HTTP API). This appears to not support any sort of corrections, based on a comment in StreamText's documentation.

Twitch appears to only support 608-based captions (with timecodes).

From what I'm reading of WebVTT spec, the key blocker for using in live environments is the lack of corrections/updates of existing captions. Being able to have that in combination with timecode information would be helpful to ensure all viewers get captions at the correct time.

gkatsev commented 2 years ago

Thanks @micolous! Updates/corrections is something that has come up as part of the unbounded cues work that's happening now #496. I've been slowly updating various use-cases for live/unbounded cues use cases here https://github.com/w3c/media-and-entertainment/pull/77 If you're able to elaborate on what you'd want, that would be helpful!

nigelmegitt commented 2 years ago

@micolous can you define exactly what you mean by a correction? Is it just a change in presentation of the existing cue, or is there a requirement to update the data model for a previously sent cue?

In other words is it okay to have:

time1 -> time2
Sam likes being pickled.

time2 -> time3
Sam likes being tickled.

where the second one just replaces the first one.

or do you mean something more, er, revisionist, so that if there's a rewind window and the user goes back, they can never reproduce the erroneous caption? E.g.:

time1 -> time2
Sam likes being pickled.

time1 -> time3 <== somehow overwriting previous cue from time1
Sam likes being tickled.
micolous commented 2 years ago

@micolous can you define exactly what you mean by a correction? Is it just a change in presentation of the existing cue, or is there a requirement to update the data model for a previously sent cue?

The two examples you give are a bit different to what I had in mind, but I think they still have value. By "corrections", I want a Backspace command and direct captioning. :)

For context, I was in the video team for PyConAU 2021... and we used stenographers for the first time in that event. A couple of people in the team had used them before for a different event, so knew what to expect... but it was new for me. 😄

What we got out of Streamtext is a stream of the stenographer's inputs, typically coming a word at a time (because of chording). My understanding of the stenographer's side of the setup is that their stenotype's software turns chords into keypress events, which end up in Streamtext's app, which sends it to their web service and broadcasts to all viewers.

There is no timing information from Streamtext - clients just see a stream of characters coming down the line. Streamtext has a demo on their website, this uses the same protocol as you get out of actual events. The only difference with that demo is that it never presses Backspace.

From what's available publicly, I believe the YouTube-StreamText interaction is that there's a HTTP API which just has one action: "Display Caption Immediately". As a result, you need a complete cue to display at any time, and your "correction" is that you push a new one to overwrite it.

About parity with CEA-608

Since writing my earlier comment, I've been looking into libcaption, which seems to be the only thing in the open source space capable of producing and muxing CEA-608 captions. I've also found Caption Inspector, which is a useful debugging tool for those streams.

I'm still experimenting with all of this -- so far it looks like libcaption has some serious problems with its outputs compared to live ATSC/NTSC broadcasts. I've started implementing smearing support for pre-recorded content (so that you don't have one frame with 60 blocks of control commands on it) in a fork, things are working better but not perfect due to FLV frame ordering problems.

My next experiment will be trying to build some simulated live caption feed in direct mode, and maybe seeing if I can get CEA-708 captions working instead.

The control command end of caption is actually quite different to how it presented in this proposal - it's part of the protocol's off-screen buffer control:

My ultimate wish for a future event would be to give stenographers an low-latency audio feed (maybe over WebRTC), and instrument that feed so we can match their inputs to when where they were listening. Then, use that to inject captions at a later stage of our video pipeline (which has several seconds of delay) to send off to our streaming provider.

Working with CEA-608 streams is very very difficult and fragile compared to WebVTT, so getting some protocol extensions to bring it to some sort of feature parity would be very helpful. It'll also help when convincing our current streaming provider they need to up their game a bit, as they only support WebVTT for pre-recorded content -- not live.

micolous commented 2 years ago

The fundamental difference with the 608/708 captions (and how StreamText works as well) is that they're a stream1️⃣, and work similar to a terminal control protocol. By contrast, WebVTT is very cue-oriented, like a packet.

Pre-recorded content with EIA-608 captions expresses a cue-oriented structure by using resume caption loading and end of caption to control output display. CEA-708 implements it slightly differently, defining initially-invisible windows which are later displayed. The only change for live in both of these environments is that they use direct captioning, or windows which are always visible.

I initially approached the "live" concern looking for the same "stream" thinking in WebVTT, and only passing a minimal amount of data, but I don't think direct captioning fits well within WebVTT's design -- and it may not be needed.

496 with unbounded cues, and replacing the previous cue as previously suggested could do well enough for replacing a "roll-up" style caption with word-at-a-time inputs -- which would be good enough for working with StreamText. For example:

time1 -->
One small step
for

time2 -->
One small step
for man

time3 -->
<rollup 1>
for man.
One

time4 -->
for a man.
One giant

The complicated part with WebVTT is to be able to handle word-at-a-time (rather than line at a time), you'd need a step between which renders the live stream of stenographer's inputs into a buffer, and then snapshot that buffer to produce cues.

TV captions can also seemlessly switch between pre-recorded content and live-content, and each program can define whatever caption styles it wants, without needing to restart the stream. In order to achieve parity, WebVTT would need a mechanism to redefine regions and styles mid-stream, and provide a "full reset" command for switching between different content sources (such as a commercial break):

WEBVTT

REGION
<!-- define regions here -->

<!-- program cues -->

time1 -->
And now a word from
our sponsors....

time2 --> time2
<reset>
<!-- forget all the regions that were set up before, hide all captions -->

REGION
<!-- define new regions -->

time3 --> time4 region:left
Ask your doctor today about Advertising!

1️⃣ Yes, CEA-708 is built with DTVCC packets which contains service blocks (caption tracks), but they still have a mix of "terminal-control" type commands, such as define window and set pen colour mixed with human-readable text. These "packets" exist mostly to multiplex different streams.