VTT Metadata Cue format is ambiguous; some metadata may be unintentionally presented to the user in a context outside HTML

cookiecrook commented 1 year ago

VTT Metadata Cue format is ambiguous; some metadata may be unintentionally presented to the user in a context outside HTML.

Consider clarifying that metadata cues SHOULD or MUST be formatted as one or more unambiguous patterns. JSON is the obvious one, to retain backwards compatibility with the JSON usage documented in the VTT spec, but there may be others.

WEBVTT

1
00:00:10.123 --> 00:00:15.432
{
  key: "value"
}

Background

§ 4.2.1. WebVTT metadata text (Normative) defines metadata text as:

WebVTT metadata text consists of any sequence of zero or more characters other than U+000A LINE FEED (LF) characters and U+000D CARRIAGE RETURN (CR) characters, each optionally separated from the next by a WebVTT line terminator. (In other words, any text that does not have two consecutive WebVTT line terminators and does not start or end with a WebVTT line terminator.)

WebVTT metadata text cues are only useful for scripted applications (e.g. using the metadata text track kind in a HTML text track).

§ 1.7. Metadata example (Informative) clarifies:

A WebVTT file can consist of time-aligned metadata.

Metadata can be any string and is often provided as a JSON construct.

Problem

"Metadata can be any string" results in a format that is ambiguous, and therefore may be presented to the user unintentionally.

In an HTML <video> element, this ambiguity is resolved by the author providing a kind="metadata" attribute on the text track.

But there isn't a logical place to duplicate this disambiguation in some other VTT contexts, including when they are embedded in some media container formats.

Proposed Solution

Consider clarifying that metadata cues SHOULD or MUST be formatted as one or more unambiguous patterns. JSON is the obvious one, to retain backwards compatibility with the JSON usage documented in the VTT spec, but there may be others.

Additional context for this change in the following issue.

512

cookiecrook commented 1 year ago

@eric-carlson @chrisn @gkatsev

bdougherty commented 1 year ago

FWIW, there is a somewhat common metadata format that uses URLs and not JSON for describing preview thumbnails when hovering over the progress bar:

WEBVTT

00:00:00.000 --> 00:00:00.979
https://cdn.example.com/thumbnails.jpg#xywh=0,0,284,160

00:00:00.979 --> 00:00:01.959
https://cdn.example.com/thumbnails.jpg#xywh=284,0,284,160

00:00:01.959 --> 00:00:02.938
https://cdn.example.com/thumbnails.jpg#xywh=568,0,284,160

cookiecrook commented 1 year ago

Any leading URI protocol regex seems easy enough to incorporate in addition to JSON.

chrisn commented 1 year ago

It seems difficult to introduce stricter constraints on the format without risking breaking someone's application. I note that the HTML spec includes a VTT example with a non-JSON custom metadata format. I have no idea how commonly used such formats may be used in practice, though. So if we do make a change to the VTT spec, that example would also need to be updated.

Have you considered other options for how schema information could be signaled? One idea: the VTT parsing algorithm requires implementations to ignore everything following WEBTT on the first line of the file. Conceivably, a backwards-compatible change could be to extended this line to include an optional schema identifier which would be ignored by today's implementations.

gkatsev commented 1 year ago

As @chrisn mentioned, introducing stricter constraints won't work, as it'll break folks. As you mentioned, the kind is available if the VTT file is included an HTML file which describes its kind, it's been briefly talked about having this be in the WebVTT file itself but so far nothing has materialized.

Those discussions involve WebVTT re-adding headers as an extension point to facilitate HLS's X-TIMESTAMP-MAP and webm's metadata. This header could be used to display the type of VTT file this is.

@chrisn's suggest of using the WEBVTT line is similar to using headers. Personally, I don't think it's good enough because we want to definitively remove ambiguity, so, using an optional extension for a core part of the spec seems fraught.

I think we probably want to have some other formal backwards-compatible signal that's in-file to indicate the kind of VTT file is being represented. Perhaps a new METADATA block, a la REGION and STYLE? It could start with a single property kind, that is equivalent to HTML Text Track kind.

My only concern with such an addition is the potential appetite for implementing WebVTT additions by folks.

cookiecrook commented 1 year ago

Mentioned in TTWG April 27 Minutes.

Resolved to keep this open and explore @gkatsev's idea:

Perhaps a new METADATA block, a la REGION and STYLE? It could start with a single property kind, that is equivalent to HTML Text Track kind.

cookiecrook commented 1 year ago

Resolved to keep this open and explore @gkatsev's idea:

Perhaps a new METADATA block, a la REGION and STYLE? It could start with a single property kind, that is equivalent to HTML Text Track kind.

@eric-carlson and I would like to make some progress on this idea, defining potential properties with #512 as the initial use case. Is anyone else interested in meeting about this in a breakout session at the Sept 2023 TPAC in Sevilla?

512

cookiecrook commented 1 year ago

Here's what we're thinking: A new ATTRIBUTES block. Renamed from @gkatsev's original proposal to avoid the redundancy in the case of METADATA kind: metadata.

Alongside Existing Usage (Subtitles, Captions, Descriptions)

Dialog-Only Subtitles

WEBVTT

ATTRIBUTES
kind: subtitles
srclang: es-mx
label: Español

NOTE
Standard subtitles (unlike CC or SDH captions) typically 
translate spoken dialog or signage, but not audible sounds 
effects like "dogs barking."

1
00:00:10.123 --> 00:00:15.432
¡Hola! ¿Qué tál?

Captions aka Subtitles for the Deaf and Hard-of-Hearing (including non-dialog sounds)

WEBVTT

ATTRIBUTES
kind: captions
srclang: es-mx
label: Español (SDH)

NOTE
Captions (SDH aka Subtitles for the Deaf and Hard-of-Hearing) 
typically include spoken dialog as well as important audible 
sounds such as "floor boards creak", "dogs barking", or in 
this case, "music".

1
00:00:10.123 --> 00:00:15.432
¡Hola! ¿Qué tál?

2
00:00:47.462 --> 00:01:04.028
[♫ música ♫]

Descriptions (e.g. audio descriptions and/or braille for the Blind/LowViz/DeafBlind communities)

WEBVTT

ATTRIBUTES
kind: descriptions
srclang: en-us
label: English (AD)

NOTE
VTT-based descriptions are meant to render as text-to-speech audio or braille,
for blind or deafblind audiences, not usually as visual captions on screen. 
As such, the option/label might be displayed in an audio menu or elsewhere. 

1
00:00:10.123 --> 00:00:15.432
A young girl tiptoes down a dark hallway.

Metadata Examples (with a new `type` attribute)

A potential accessible update to the Thumbnails usage @bdougherty mentioned above.

WEBVTT

ATTRIBUTES
kind: metadata
type: video-thumbnails

NOTE
In order to support accessibilty, the simple URL-only thumbnail 
format mentioned above should be updated to include "alt" text for 
each. In the potential format below, I've written that as a JSON 
block containing alt strings for multiple supported languages.

00:00:01.959 --> 00:00:02.938
{
    "src": "https://cdn.example.com/thumbnails.jpg#xywh=0,0,284,160",
    "alt": {
        "en-us": "Miguel crosses the marigold bridge to the land of the dead.",
        "es-mx": "Miguel cruza el puente marigold hacia la tierra de los muertos."
    }
}

A Proposal for the Flashing Lights Avoidance Use Case from Issue #512

WEBVTT

ATTRIBUTES
kind: metadata
type: video-flash-avoidance

NOTE
Spec for "video-flash-avoidance" (or "video-flash", "strobing", etc.) type would define 
usage as a JSON block with one required and two optional key/value pairs:
- integer "intensity": 0-100
- opt token "flash-type": ["general-flash" (default) | "red-flash" | "spatial-pattern"]
- opt token "algorithm": ["undefined" (default) | "harding" | "apple-vfr" (bikeshed, algo needs name)]

NOTE
The v1 Apple open-sourced algorithm (bikeshed name "apple-vfr" for "video 
flashing reduction") only detects "general-flash" patterns (not yet 
"red-flash" or "spacial-pattern"), but we think it performs better than the 
de facto Harding test in those instances of "general-flash". See below for 
example where "harding" would still need to be used to denote the 
"spatial-pattern" cue that the open-sourced algorithm doesn't yet account for.
Cite: https://developer.apple.com/accessibility/#dim-flashing-lights

1
00:00:10.123 --> 00:00:15.432
{
  "intensity": "75",
  "flash-type": "general-flash",
  "algorithm": "apple-vfr"
}

2
00:00:47.462 --> 00:01:04.028
{
  "intensity": "100",
  "flash-type": "spatial-pattern",
  "algorithm": "harding"
}

cookiecrook commented 1 year ago

Is anyone else interested in meeting about this in a breakout session at the Sept 2023 TPAC in Sevilla?

Better yet, time on the standard TTWG meeting schedule for TPAC.

Note: I looked for a F2FCandidate or TPACCandidate keyword. Not sure how you're tracking that list.

chrisn commented 1 year ago

Is anyone else interested in meeting about this in a breakout session at the Sept 2023 TPAC in Sevilla?

I'm interested, yes.

cookiecrook commented 1 year ago

@nigelmegitt et al, can we get time on the TTWG schedule at TPAC? I think that's a better forum than a breakout session. Also would be good to coordinate with @jasonjgw and others interested in MAUR Issue 2

cookiecrook commented 1 year ago

I haven't seen a published schedule, but most of Tuesday afternoon (Sept 12, CET) is still open for me.

cookiecrook commented 1 year ago

Currently scheduled for 14:30 CET on Tuesday Sept 12th. Thanks Nigel.

andreastai commented 1 year ago

WEBVTT

ATTRIBUTES kind: metadata type: video-flash-avoidance

@cookiecrook Some questions regarding your proposal of a new type attribute in the proposed attribute block:

As other attributes are derived from TextTrack: do you suggest also adding type as an attribute to TextTrack?
Should some or all possible values for type be defined in a controlled vocabulary/registry? In a registry, the specific values could be linked to the specification that defines the content format of a specific type (e.g. the spec for "video-flash-avoidance").

css-meeting-bot commented 1 year ago

The Timed Text Working Group just discussed VTT Metadata Cue format is ambiguous; some metadata may be unintentionally presented to the user in a context outside HTML w3c/webvtt#511, and agreed to the following:

SUMMARY: Strong support for this new ATTRIBUTE block but we probably don't want this to hold up the current version of WebVTT from progressing to Rec

The full IRC log of that discussion

<nigel> Subtopic: VTT Metadata Cue format is ambiguous; some metadata may be unintentionally presented to the user in a context outside HTML w3c/webvtt#511
<nigel> github: https://github.com/w3c/webvtt/issues/511
<jcraig> Slides... https://www.icloud.com/keynote/09dCEDKVwnUk_nhjBrkArTIcg#WebVTTMetadata_public
<nigel> jcraig: Hi everyone, we've been talking about this topic for at least a couple of years, we think we
<nigel> .. have a way forward thanks to a suggestion from Gary.
<nigel> .. I'm going to cover problems with VTT metadata today,
<nigel> .. Proposed solution with exampless,
<nigel> .. and a Specific new use case for strobing
<nigel> .. [slide 3]
<nigel> .. Example of thumbnails metadata.
<nigel> .. Problem today is ambiguity
<nigel> .. [slide 5]
<nigel> .. Can't tell if it's metadata
<nigel> .. [slide 6]
<nigel> .. Or what type of metadata, e.g. key value pair vs JSON
<nigel> .. [slide 7]
<nigel> .. Proposal: ATTRIBUTES in VTT
<nigel> .. More or less the same as Gary's suggestion, a different name.
<nigel> .. [slide 8]
<nigel> .. Example: Dialog-Only "Subtitles"
<nigel> .. minimal usage is ATTRIBUTES block with a kind: attribute, should be the same as the HTML video track element's kind attribute.
<nigel> .. That definition in HTML is handled, but there is nothing to define VTT as subtitles outside
<nigel> .. that use case, e.g. in an MPEG container.
<nigel> .. srclang is one of the suggestions, here es-mx with a label: Español
<nigel> .. The difference between subtitles and captions being whether sound effects are included for
<nigel> .. the deaf and hard of hearing.
<nigel> .. [slide 9 (?)]
<nigel> .. Another example is descriptions aka Audio Description
<nigel> .. Using text to speech, or text to Braille.
<nigel> .. Users who cannot see the media but want to watch alongside friends and family.
<nigel> .. Or hearing viewers who do not want to disrupt their co-watchers, as Leonie Watson
<nigel> .. mentioned last year.
<nigel> .. The label in this case would be in an audio menu not a subtitle menu.
<nigel> .. [slide 10]
<nigel> .. Metadata example from before. kind: metadata.
<nigel> .. That's why we didn't choose METADATA.
<nigel> .. We introduced "type" where we're maintain a regisitry, TTWG would be a good home for that.
<nigel> .. I chose video-thumbnails here, but it doesn't have an accessible label.
<nigel> .. [slide 11]
<nigel> .. JSON version, with multiple languages of alt text in different languages.
<nigel> .. This allows the previous example to be accessible.
<nigel> .. The video-thumbnails could be in the registry pointing to a spec that defines the JSON format.
<nigel> .. [slide 12] Use Case: Video Strobing
<nigel> .. If you look at the description of #511, this proposal comes further down.
<jcraig> https://github.com/w3c/webvtt/issues/512
<nigel> .. Also see #512 is the impetus for this particular discussion.
<nigel> .. Apple released a feature in the Spring called Dim Flashing Lights which is a way
<nigel> .. to mitigate flash patterns as they happen in media, for people with light based discomfort or
<nigel> .. epilepsy. We'd like to timecode the risk times with WebVTT metadata.
<nigel> .. [slide 13] "Warning..." that refers to a few seconds of flashing at an unknown point in a 2 hour movie.
<nigel> .. People tell me their partner has to watch ahead to find the flashing section so they can skip over it on second watch.
<nigel> .. This metadata exists, but we'd like to push it forward to viewers.
<nigel> .. I have a small video which I'll show that has some flashing in it.
<nigel> .. [slide 14] genuine warning
<nigel> .. If you're sensitive, cover the bottom left portion of the screen.
<nigel> .. [slide 15, shows video]
<nigel> .. You can see the risk estimation is a lot lower on the right side than the lefft.
<nigel> .. Dim Flashing Lights is on GitHub.
<nigel> .. [slide 16] Open source links
<nigel> .. We auto-mitigate this ourselves without the need for this API by looking ahead in the frame buffer.
<nigel> .. We can't do it on 3rd party hardware though, e.g. AppleTV+ on 3rd party machines where we don't
<nigel> .. have access to the lower level GP level frameworks.
<nigel> .. In addition to the mitigation there are other user level features, like allowing the user to skip
<nigel> .. the sections they don't want. We have shipped something similar for HLS but would like it on the web
<nigel> .. and more standardised in VTT.
<nigel> .. [slide 17] Example metadata for flashing
<nigel> .. We have type: video-strobing which would point to a registry
<nigel> .. We also have intensity, flash type and algoriithm
<nigel> .. There are 3 types of flashing, general, spatial pattern or red. They're all listed in the WCAG.
<nigel> .. Our algorithm can identify general flash.
<nigel> .. This example has us intermingling our algorithm with others.
<nigel> Nigel: Is your idea that one VTT file identifies flashes discovered by multiple algorithms?
<nigel> jcraig: Potentially yes
<nigel> Nigel: And why does the user care about the algorithm?
<nigel> jcraig: The user probably would not, but it might be useful in choosing behaviour to work around the flash.
<nigel> eric_carlson: The user agent can use the type of flashing to work out whether to skip or mitigate a different way.
<nigel> Nigel: That's type rather than algorithm?
<nigel> jcraig: Ideally this intensity value should be agreed on and testable.
<nigel> .. In some cases different algorithms give different intensity levels for the same flashing.
<nigel> .. The goal of listing the algorithm is to reconcile where that number came from.
<nigel> .. Ideally if we thought it was 90% no matter which one it came from, but if they have very different
<nigel> .. values we might actually trust the worst case.
<nigel> Evan_Liu: Is this part of the proposal?
<nigel> jcraig: The proposal today is for the ATTRIBUTES block and this is our first use case.
<nigel> .. [Slide 18]
<nigel> .. Other potential Use cases
<nigel> .. Physical sensitivity warnings, content trigger warnings
<nigel> .. Motion induced vertigo, jump scares etc.
<nigel> .. I've seen these warnings in video games as well, e.g. upcoming violence or suicide.
<nigel> .. We could mark up a bunch of different things depending on what the video or game industry wants.
<nigel> .. [slide 19] Questions?
<nigel> .. That's all of the proposal today.
<nigel> .. We're proposing the ATTRIBUTES block.
<atai> q+
<nigel> .. If that's useful I can help with the PR, and Eric could write an implementation.
<nigel> .. We could use this flash pattern as the first point outward from the registry.
<nigel> Evan_Liu: For your video scrubbing example, it would be up to the content provider to decide when to apply these mitigations?
<nigel> jcraig: I can't speak for all, but typically for Apple's media library, the content provider just
<nigel> .. submits Yes/No for sensitivity. The goal would be to add this as another asset alongside ingest
<nigel> .. into Apple's library.
<nigel> .. I don't know how others do it.
<nigel> .. Apple produces content too so we'd do it too.
<nigel> q+ atai
<nigel> ack at
<nigel> Andreas: I would support both parts of this.
<nigel> .. First question/comment: I think it's necessary because WebVTT is used outside the web context,
<nigel> .. so it's really a requirement to have this.
<nigel> .. For the ATTRIBUTE block you are adding a new attribute, "type". Should this also be added back
<nigel> .. to the <track> element, which has media data but no type.
<nigel> eric_carlson: A further question: If we do have this, we need to define the processing rules for
<nigel> .. a UA if the kind or type attributes in the HTML and the WebVTT document don't agree.
<nigel> .. If I know there's an ATTRIBUTES block do I even need the attributes in the track element?
<nigel> .. The only benefit to having the attributes on the track element is if someone is looking at the DOM.
<nigel> s/DOM/source to the web page.
<nigel> Nigel: Doesn't that force the UA to download all the track element resources on page load to work out what to do?
<nigel> eric_carlson: That's a very good point
<nigel> .. Right now the mode by default is disabled which means nothing is loaded, but you still need
<nigel> .. to construct al the UI
<nigel> gkatsev: I think that answers what to do if there are conflicting track attributes.
<nigel> eric_carlson: Yes, the track element should always win
<nigel> jcraig: That happens in SVG images with an img element where both have a label, the local one wins.
<atai> q+
<nigel> gkatsev: I think we didn't really answer if the idea is that we want a new type attribute in the HTML
<nigel> ack at
<nigel> atai: I wanted to add that - do we need attributes at all in the VTT - they can be used for other formats.
<nigel> eric_carlson: I think we do. Maybe type is too generic a name for it.
<nigel> .. As we proposed it here it is metadata type
<nigel> gkatsev: It probably is better for the name in the VTT file to have a similar name to the HTML attribute
<nigel> .. but it doesn't actually have to be.
<nigel> eric_carlson: It will confuse people if they're different
<nigel> gkatsev: The bikeshed name is metadata_type.
<nigel> eric_carlson: That makes sense
<nigel> jcraig: If that's the case we would have the same metadata_type name in both places.
<nigel> .. kind: metadata
<nigel> .. metadata_type: video_strobing
<nigel> Evan_Liu: An enumerated list?
<atai> q+
<nigel> jcraig: A registry of some kind, can be an informative resource.
<nigel> .. Some table that reserves a type value and has a pointer to where it is defined.
<nigel> ack at
<nigel> atai: In TTML, possibly in HTML too, you can add private values if you prefix with x-...
<nigel> .. eric_carlson: Yes it would be a good idea to have a rule for "official" types e.g. it cannot start
<nigel> .. with x- or whatever, or to override, then it must start with something like data-
<nigel> .. Having one place where people can go to find out how to author is a good ide.
<nigel> s/de./dea.
<nigel> jcraig: I like the idea of supporting prototyping and then adding to the registry
<nigel> gkatsev: Would the registry point to a note or a spec, or just reserve the name
<nigel> eric_carlson: It should point to a spec or a note
<nigel> jcraig: I think it should point to a spec
<atai> q+
<nigel> Nigel: Typically with a Registry we need to define the rules for adding values, so we could require
<nigel> .. a pointer to a document with a URL, and whatever other changes.
<nigel> gkatsev: As long as this group is not responsible for writing those documents.
<nigel> ack at
<nigel> Andreas: This new block is for defining new attributes, and I can see people using it.
<nigel> .. Would you allow others to add their own new attributes to the block?
<nigel> eric_carlson: As long as we define the parsing rules in the same way as VTT does now,
<nigel> .. like take the first word, look in the list, if it's not there skip to the next line etc.
<nigel> .. Or an older user agent would skip things it didn't know about.
<nigel> gkatsev: That's probably fine. My main concern is if someone uses some attribute that we want to add
<nigel> .. then the value might break.
<nigel> jcraig: That's why we have an interop problem in HTML now.
<nigel> Nigel: Question about syntax. What level of complexity do we need to plan to support.
<nigel> .. How did you get to that key: value syntax?
<nigel> jcraig: I just thought most folk would want key value pairs, we could use javascript
<nigel> eric_carlson: Most of the time we aren't going to need multiple lines
<nigel> jcraig: In an attributes block I can't think of one
<nigel> eric_carlson: You might want something like alt text that describes the content of the file
<nigel> .. you could have a different block for that though
<nigel> jcraig: That's not defined by this spec?
<nigel> eric_carlson: Or a licence
<nigel> Andreas: In attributes you cannot put multiple lines
<nigel> Nigel: You can put escaped new line characters in
<nigel> Andreas: Do we need that here?
<nigel> Nigel: That's the question
<nigel> gkatsev: You don't want multi-line attribute values
<nigel> Nigel: alt text is the classic example
<atai> q+
<nigel> jcraig: Like you said, you can use \n
<nigel> Nigel: That's a question for the syntax specification, do we want to escape new lines
<nigel> gkatsev: Does WebVTT support that now?
<nigel> eric_carlson: Not right now I don't think it does
<nigel> jcraig: One of the goals is to not break existing parsers.
<nigel> eric_carlson: You can use the existing parser for that as long as you don't have a blank line within a block
<nigel> ack atai
<nigel> Andreas: One important question is if this new addition should be in the Rec track version of the document.
<nigel> .. Is this a proposal to add to the CR or the next upcoming stable version?
<nigel> gkatsev: One thing I've been for a while trying to do is to try and get just what is currently
<nigel> .. implemented in VTT out as Rec. Adding new features would make that harder.
<nigel> .. That said, it doesn't mean we can't work on the feature.
<nigel> .. It might be worth adding it anyway before Rec.
<nigel> .. For me, and I should check how to do it, there's the evergreen spec stuff where small features
<nigel> .. can be added into Rec more easily and this seems like a good candidate for that.
<nigel> eric_carlson: I agree, it would be a shame to postpone Rec any more than we have to.
<nigel> .. Not having it in v1 of the spec isn't going to keep us from adding that to the feature set.
<nigel> jcraig: It can be in the ED or WD as soon as the group thinks it's a good idea.
<nigel> .. As soon as everyone who has a stake in that thinks it is ready to go we can take a flag off it.
<cyril> RRSAgent, pointer
<nigel> Andreas: The registry can be created anyway.
<RRSAgent> See https://www.w3.org/2023/09/12-tt-irc#T13-28-00-1
<nigel> atsushi: I think we decided to bring the spec solely into TTWG? I haven't seen anything on that for a few years.
<nigel> .. Mostly the spec was developed under CG.
<nigel> Nigel: I'm puzzled by the question
<nigel> gkatsev: The CG still exists but noone is involved in advancing it from that side.
<nigel> .. Maybe we should close the CG?
<nigel> atsushi: When I looked a few years ago there was an objection to closing the CG.
<nigel> atsushi: CGs can only work on CG drafts, so only we can work on the Rec track document
<nigel> s/atsushi/atai
<nigel> atsushi: That is one of my naive questions. When I tried to update the repo to be just TTWG I believe
<nigel> .. someone objected.
<nigel> gkatsev: I haven't heard any objections
<nigel> SUMMARY: Strong support for this new ATTRIBUTE block but we probably don't want this to hold up the current version of WebVTT from progressing to Rec
<nigel> Evan_Liu: If the UA does not support these new features does it need to parse the new block?
<nigel> eric_carlson: According to the current parsing algorithm it will not break anything
<nigel> jcraig: We'll work on a PR. Thank you.
<nigel> gkatsev: Thanks.

cookiecrook commented 11 months ago

@silviapfeiffer Making sure you saw this since we missed you at TPAC. If you have any feedback, please share. Also, I plan to work with @eric-carlson on a VTT PR soon, unless you'd prefer to author it.

cookiecrook commented 2 months ago

PR is ready for review.

523

w3c / webvtt