Open silviapfeiffer opened 8 years ago
goal from FOMS: add test cases
discussion at FOMS: separate between two use cases:
1/ live broadcasting
This has near-realtime requirements and focuses on the use of WebVTT with MSE/HLS. Main requirement there is to allow undefined end times and update them on cues.
2/ real-time video/audio communication (also called Realtime Captioning RTC)
This has realtime requirements with an ability to support the "editing"-type functionality of 608/708. This use case motives the spec in this bug.
I think we may need VTT file-level support for the concept "I am updating this cue" so that clients can work out what's new/changed. One case is incremental builds e.g. after speech recognition I I think I think we may I think we may need I think we may need incrementally
and so on. Another case is where a cue has to be sent immediately but can then be edited and fixed.
I would very much like to understand best current practices in captioning of video telephony and conferences (if any).
I would very much like to understand best current practices in captioning of video telephony and conferences (if any).
Running a recent conference, we had stenographers using streamtext.net to provide subtitles. These were delivered out of band with the rest of the live (HLS) video feed in a separate browser window.
We were lucky with synchronisation issues, because the stenographers and our video distribution service were both in US, and most participants were in Australia behind a couple of layers of CDN -- the stenographers were up to 10 seconds ahead of most viewers, but it varied... there was a viewer who was 2 minutes behind!
Under the hood, Streamtext's viewer web UI polls a JSON API to get updates, which works as follows:
In "basic" mode, the messages themselves are simple strings which are displayed immediately, with a couple of control characters:
\r\n
: newline\x08
: backspaceThere are more sophisticated output options with formatting, but the simple case is just that. There is also no timecode information available.
OBS has support for 608-style embedded captions, but the only way to deliver it is via a third-party WebSockets plugin, which lacks timecodes.
YouTube Live can ingest either 608-style captions (with timecodes), or out-of-band, and reference a few third-party tools which integrate (using a private HTTP API). This appears to not support any sort of corrections, based on a comment in StreamText's documentation.
Twitch appears to only support 608-based captions (with timecodes).
From what I'm reading of WebVTT spec, the key blocker for using in live environments is the lack of corrections/updates of existing captions. Being able to have that in combination with timecode information would be helpful to ensure all viewers get captions at the correct time.
Thanks @micolous! Updates/corrections is something that has come up as part of the unbounded cues work that's happening now #496. I've been slowly updating various use-cases for live/unbounded cues use cases here https://github.com/w3c/media-and-entertainment/pull/77 If you're able to elaborate on what you'd want, that would be helpful!
@micolous can you define exactly what you mean by a correction? Is it just a change in presentation of the existing cue, or is there a requirement to update the data model for a previously sent cue?
In other words is it okay to have:
time1 -> time2
Sam likes being pickled.
time2 -> time3
Sam likes being tickled.
where the second one just replaces the first one.
or do you mean something more, er, revisionist, so that if there's a rewind window and the user goes back, they can never reproduce the erroneous caption? E.g.:
time1 -> time2
Sam likes being pickled.
time1 -> time3 <== somehow overwriting previous cue from time1
Sam likes being tickled.
@micolous can you define exactly what you mean by a correction? Is it just a change in presentation of the existing cue, or is there a requirement to update the data model for a previously sent cue?
The two examples you give are a bit different to what I had in mind, but I think they still have value. By "corrections", I want a Backspace command and direct captioning. :)
For context, I was in the video team for PyConAU 2021... and we used stenographers for the first time in that event. A couple of people in the team had used them before for a different event, so knew what to expect... but it was new for me. 😄
What we got out of Streamtext is a stream of the stenographer's inputs, typically coming a word at a time (because of chording). My understanding of the stenographer's side of the setup is that their stenotype's software turns chords into keypress events, which end up in Streamtext's app, which sends it to their web service and broadcasts to all viewers.
There is no timing information from Streamtext - clients just see a stream of characters coming down the line. Streamtext has a demo on their website, this uses the same protocol as you get out of actual events. The only difference with that demo is that it never presses Backspace.
From what's available publicly, I believe the YouTube-StreamText interaction is that there's a HTTP API which just has one action: "Display Caption Immediately". As a result, you need a complete cue to display at any time, and your "correction" is that you push a new one to overwrite it.
Since writing my earlier comment, I've been looking into libcaption, which seems to be the only thing in the open source space capable of producing and muxing CEA-608 captions. I've also found Caption Inspector, which is a useful debugging tool for those streams.
I'm still experimenting with all of this -- so far it looks like libcaption has some serious problems with its outputs compared to live ATSC/NTSC broadcasts. I've started implementing smearing support for pre-recorded content (so that you don't have one frame with 60 blocks of control commands on it) in a fork, things are working better but not perfect due to FLV frame ordering problems.
My next experiment will be trying to build some simulated live caption feed in direct mode, and maybe seeing if I can get CEA-708 captions working instead.
The control command end of caption
is actually quite different to how it presented in this proposal - it's part of the protocol's off-screen buffer control:
resume direct captioning
: put following inputs into the on-screen bufferresume caption loading
: put following inputs into the off-screen bufferend of caption
: move off-screen buffer to on-screen buffer, does nothing if off-screen buffer is emptyerase display memory
: clear on-screen buffererase non-display memory
: clear off-screen bufferMy ultimate wish for a future event would be to give stenographers an low-latency audio feed (maybe over WebRTC), and instrument that feed so we can match their inputs to when where they were listening. Then, use that to inject captions at a later stage of our video pipeline (which has several seconds of delay) to send off to our streaming provider.
Working with CEA-608 streams is very very difficult and fragile compared to WebVTT, so getting some protocol extensions to bring it to some sort of feature parity would be very helpful. It'll also help when convincing our current streaming provider they need to up their game a bit, as they only support WebVTT for pre-recorded content -- not live.
The fundamental difference with the 608/708 captions (and how StreamText works as well) is that they're a stream1️⃣, and work similar to a terminal control protocol. By contrast, WebVTT is very cue-oriented, like a packet.
Pre-recorded content with EIA-608 captions expresses a cue-oriented structure by using resume caption loading
and end of caption
to control output display. CEA-708 implements it slightly differently, defining initially-invisible windows which are later displayed. The only change for live in both of these environments is that they use direct captioning, or windows which are always visible.
I initially approached the "live" concern looking for the same "stream" thinking in WebVTT, and only passing a minimal amount of data, but I don't think direct captioning fits well within WebVTT's design -- and it may not be needed.
time1 -->
One small step
for
time2 -->
One small step
for man
time3 -->
<rollup 1>
for man.
One
time4 -->
for a man.
One giant
The complicated part with WebVTT is to be able to handle word-at-a-time (rather than line at a time), you'd need a step between which renders the live stream of stenographer's inputs into a buffer, and then snapshot that buffer to produce cues.
TV captions can also seemlessly switch between pre-recorded content and live-content, and each program can define whatever caption styles it wants, without needing to restart the stream. In order to achieve parity, WebVTT would need a mechanism to redefine regions and styles mid-stream, and provide a "full reset" command for switching between different content sources (such as a commercial break):
WEBVTT
REGION
<!-- define regions here -->
<!-- program cues -->
time1 -->
And now a word from
our sponsors....
time2 --> time2
<reset>
<!-- forget all the regions that were set up before, hide all captions -->
REGION
<!-- define new regions -->
time3 --> time4 region:left
Ask your doctor today about Advertising!
1️⃣ Yes, CEA-708 is built with DTVCC packets which contains service blocks (caption tracks), but they still have a mix of "terminal-control" type commands, such as define window
and set pen colour
mixed with human-readable text. These "packets" exist mostly to multiplex different streams.
To address REQ2 of #318 , we are after an extension of the WebVTT file format.
The principle idea is that we map the TextTrack API calls from #319 to how we would archive them in a WebVTT file to replicate the functionality.
The approach we try out here is to use the 00:00:00 cue timestamps as the means to separate cues into smaller incremental cues. <now()> will be acue timestamp with the now() time.
new VTTCue(now(), NULL, '')
- make sure to set the defaults as required by 608 - then:textTrack.addCue(cue)
now() --> NULL
cue.text += char
<now()> char
cue.text += '\n'
(may need to end</c>
,</i>
,</b>
,</u>
)<now()> \n
cue.line = row
(whichever row calculated)<now()> <set line=row>
cue.text += "<u>"
orcue.text += "</u>"
(need to keep track of toggle state)<now()> <u>
or<now()> </u>
cue.text += "<c.white>"
(need to have the color style classes pre-defined) andcue.text += "<i>"
<now()> <c.white> <i>
etc.cue.position = offset
(whichever offset calculated from ident pos)<now()> <set position=offset>
cue.text += "<c.bg_white>"
(need to have the background color style classes pre-defined)<now()> <c.bg_white>
cue.text = cue.text.substr(0, cue.text.length - 1)
<now()> <set substr=(0,-1)>
cue.text = cue.text.substr(0, cursor_pos)
(need to keep track of the 608 cursor position)<now()> <set substr=(0, cursor_pos)>
new VTTRegion()
thenregion.lines = x
cue.text += "<c.blink>"
<now()> <c.blink>
cue.text = ''
<now()> <set substr=(0,0)>
cue.endTime = now(); cue.region = region; new VTTCue();
<now()> <set region=ID> \n\n now() --> NULL
cue.endTime = now()
<now()> <set endTime=now()>
cue.text = ''
<now()> <set substr=(0,0)>
cue.text += " " * num_space
(calculate numspace from tab offset as per https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html )<now()>
(add required number of spaces)Looks like what we need is a way to change cue settings and cut cue length half-way through cues, as well as an undefined end time that can be set at a later stage.