Add Support for Querying HDR Decode and Render Capabilities

vi-dot-cpp commented 5 years ago

This is part 1, which covers decoding and rendering, of the HDR two-part series. Part 2 (#119) covers display.

Modern day scenarios, based on data & partner asks we have analyzed, are increasingly requiring the need of HDR capability detection in v1. We let the following design considerations guide this proposal:

Separate decoding & rendering capabilities (MediaCapabilities) and display capabilities (Screen). Relevant threads/comments: [1][2][3][4][5][6]
Bucket vs. granularity for HDR Capabilities. See: $todo in explainer.md#HDR
Distinguish graphics and video capabilities #25
Limit finger printing

We propose the following changes to MediaCapabilities. These changes will be complemented by changes to Screen in the aforementioned linked issue.

Add a bucketized HdrCapability enum to VideoConfiguration in similar fashion to Android’s HdrCapabilities.
- A bucketed approach solves nuanced granular properties like EOTF, color depth, and color gamut [#10][ #110-comment]
- A bucketed approach also addresses the limitation that granular aspects like frame metadata are not standardized
- There are environments that currently support playing HDR content on certain SDR hardware for a pseudo-HDR experience
  1. Define HdrCapability Enum
  
  Shared in Screen and MediaCapabilities:
```
enum HdrCapability {
“HDR10”,  
“HDR10Plus”,  
“DolbyVision”,  
“HLG”; 
}; 
```
  2. Add HdrCapability Enum to VideoConfiguration
```
dictionary VideoConfiguration { 
… 
HdrCapability hdrCapability; 
}; 
```
  Team: @scottlow @gurpreetv @isuru-c-p @vi-dot-cpp from Microsoft

chcunningham commented 5 years ago

I dig it! I'm not a color expert, but I think this seems sane and does a great job of addressing the pitfalls of previous proposals. Kudos to you all for diligence.

Re: color expertise, we'll want some other folks to weigh in.

Users of this API (@mwatson2, @joeyparrish, ... ppl I don't have github handles for): does this meet your needs?
Fellow UAs (@jernoble @eric-carlson @hober @jyavenard @padenot): is this sufficiently unambiguous and implementable?

Give me a bit to collect additional Chrome feedback from folks who know the color stack.

chcunningham commented 5 years ago

Aside: If we do spec these values we may need to go the path of having a registry (similar to MSE byte-streams). Nudge, @mounirlamouri who's familiar with those reqs.

mwatson2 commented 5 years ago

Great proposal, thanks!

We have a small problem with the coupling of decoding and rendering. A video codec has no knowledge about the pixel data encoding, except the bit depth and spacial aspects. The four buckets correspond to:

PQ, BT.2020, 10bits, SMPTE ST 2086 static metadata
PQ, BT.2020, 10bits, SMPTE ST 2094 dynamic metadata (I forget which part)
PQ, BT.2020, 10bits, SMPTE ST 2094 dynamic metadata (a different part ;-)
HLG, BT.2020, 10bits

All of these could potentially work with AV-1 as well as HEVC or any other 10-bit codec (e.g. VP9). But at least the term HDR10 implies HEVC.

Now, the codecs string for some codecs (including VP9 and AV-1 [1]) can include information about Transfer Function, Color Space as well as bit depth, chroma sub-sampling, video range flag and matrix coefficients but does not include HDR dynamic metadata information. And for other codecs (e.g. HEVC) this information is not in the codec string.

I think the bucketing is fine, but the buckets should be constrained to the rendering capabilities and precisely defined in terms of Transfer Function, Color Space and metadata specification. The buckets should not carry an implication about bit depth, chroma sub-sampling, video range flag and matrix coefficients.

And then, finally, we should describe the error case where the HDR capability bucket is incompatible with the codec string.

[1] https://aomediacodec.github.io/av1-isobmff/#codecsparam

jernoble commented 5 years ago

I don't like adding vendor-specific names to specifications, so I'm hesitant to enshrine "DolbyVision" into Media Capabilities. I proposed a something similar in #110, but using transfer function, color space, and bit depth.

jernoble commented 5 years ago

I'm also concerned about conflating whether a decoder supports these HdrCapability settings and whether the display is capable of rendering the output of the decoder, hence a separate DisplayCapabilities API in #110.

chcunningham commented 5 years ago

(mwatson2) The buckets should not carry an implication about bit depth, chroma sub-sampling, video range flag and matrix coefficients.

Just to confirm, this leaves 3 parts: eotf, color gamut, and metadata? For the screen interface (#119) this would be just the first 2 (metadata handled in software)?

And then, finally, we should describe the error case where the HDR capability bucket is incompatible with the codec string.

2 routes we could go:

Today we could have a codec string provide "level" info that is technically incompatible with provided framerate, bitrate, resolution info. The spec ignores this. In Chrome we check that the codec string is valid, but we use the explicitly described fields when checking for performance/power efficiency. This is more precise (levels are sometimes large buckets). We don't cross validate (would be high effort for low return).
For EME, we explicitly validate parts of the input against other parts. See this section (ex: "If keySystemConfiguration.audioRobustness is present, audio MUST also be present."). Failure to validate triggers a TypeError.

(jernoble) I'm also concerned about conflating whether a decoder supports these HdrCapability settings and whether the display is capable of rendering the output of the decoder, hence a separate DisplayCapabilities API in #110.

I see you've found #119 ;)

vi-dot-cpp commented 5 years ago

(@mwatson2) I think the bucketing is fine, but the buckets should be constrained to the rendering capabilities and precisely defined in terms of Transfer Function, Color Space and metadata specification. The buckets should not carry an implication about bit depth, chroma sub-sampling, video range flag and matrix coefficients.

Good point. Additionally, HDR profiles like Dolby Vision could theoretically support 12-bit & 10-bit color depth [1]. We are for constraining HDR capabilities to transfer function, color gamut, and metadata. What if the HdrCapability buckets explicitly reflected these properties?

enum HdrCapability {
    “Pq_Rec2020_SmpteSt2086Static”,  // HDR10
    “Pq_Rec2020_SmpteSt2094Dynamic-40”,  // HDR10Plus
    “Pq_Rec2020_SmpteSt2094Dynamic-10”,  // DolbyVision
    “Hlg_Rec2020”;  // HLG
};

(@jernoble ) I don't like adding vendor-specific names to specifications, so I'm hesitant to enshrine "DolbyVision" into Media Capabilities.

That makes sense. Would this edited enum also address the reservation against proprietary names?

(@chcunningham ) Just to confirm, this leaves 3 parts: eotf, color gamut, and metadata? For the screen interface (#119) this would be just the first 2 (metadata handled in software)?

The display side does not technically need metadata; what do you think, though, about MediaCapabilities and Screen sharing the same HdrCapability enum for consistency?

Today we could have a codec string provide "level" info that is technically incompatible with provided framerate, bitrate, resolution info. The spec ignores this. In Chrome we check that the codec string is valid, but we use the explicitly described fields when checking for performance/power efficiency. This is more precise (levels are sometimes large buckets). We don't cross validate (would be high effort for low return).

Thanks for suggesting this route. We would like to strive for consistency.

[1] https://www.dolby.com/us/en/technologies/dolby-vision/dolby-vision-profiles-levels.pdf

jyavenard commented 5 years ago

I gather you meant Pq for DolbyVision transfer function.

There are other proposed HDR formats, in particular is SL-HDR1 and SL-HDR2.

In any case, I think splitting the capabilities between what the user-agent can handle and what can be displayed properly is the way to go.

For example, a UA using an SDR display may handle HDR content well, doing proper tone mapping etc. Preferring HDR content over SDR may still be preferred here, even if the display isn't HDR

vi-dot-cpp commented 5 years ago

(@jyavenard) There are other proposed HDR formats, in particular is SL-HDR1 and SL-HDR2.

These formats can be added to HdrCapability. Given the community's feedback, they shall be added in the following format -- [TransferFunction_ColorGamut_MetaData]. What do you think this approach?

In any case, I think splitting the capabilities between what the user-agent can handle and what can be displayed properly is the way to go.

Agreed -- #119 complements this discussion by covering the display aspect.

scottlow commented 5 years ago

@vi-dot-cpp and I chatted a bit more offline. Another approach we could take here is one similar to @jernoble's recommendation in #110:

dictionary hdrCapability {
    required ColorGamut colorGamut;
    required TransferFunction transferFunction;
    MetadataDescriptor metadata;
}

Where ColorGamut is an defined enum as follows:

enum ColorGamut {
    "srgb",
    "p3",
    "rec2020"
}

TransferFunction is an enum defined as follows:

enum TransferFunction {
    "srgb",
    "pq",
    "hlg"
}

And MetadataDescriptor is an enum defined as follows:

enum MetadataDescriptor {
    "smpteSt2086",
    "smpteSt2094-10",
    "smpteSt2094-40"
}

The MediaCapabilities spec could then define which combinations of the above enum values are valid "buckets" and we could throw a NotSupportedError exception for the rest.

chcunningham commented 5 years ago

I'd vote for the de-bucketing (separate enums for gamut, transfer, and metadata). To me its more elegant and forward looking.

It may also solve the issue of what to do for screen (no need to include metadata). This may make a case for doing away with the wrapper HdrCapability enum, flattening these new fields into VideoConfiguration directly. Then you can pick a handful for the screen API without needing a new wrapper (or a wrapper with parts that don't apply).

On a related note, these are all optional inputs (HDR is new), so we'll want to choose some sane defaults for these fields. I think srgb works for ColorGamut and TransferFunction. We probably need a "none" for the MetadataDescriptor.

Nit: consider renaming MetadataDescriptor to HdrMetadata?

vi-dot-cpp commented 5 years ago

@mwatson2 @chcunningham @jernoble @jyavenard I made a PR (#124) that reflects points brought up in this thread. I would appreciate it if you all could review it -- many thanks.

jernoble commented 5 years ago

Is this an actually useful addition to VideoConfiguration? I.e., are there any decoders that can otherwise decode the underlying frames, but are unable to meaningfully read the HDR metadata? I was under the impression that the HDR information was a container-level concept, and not a codec one. Decoders are happy to decode encoded media data, and don't really care about the interpretation of the color values emitted by the decoder; that's left to the renderer and the display hardware.

isuru-c-p commented 5 years ago

I was under the impression that the HDR information was a container-level concept, and not a codec one.

Dynamic HDR metadata is typically inserted into the compressed bitstream (e.g in HEVC, the metadata is inserted into the bitstream via SEI messages).

Decoders are happy to decode encoded media data, and don't really care about the interpretation of the color values emitted by the decoder; that's left to the renderer and the display hardware.

While this is correct, in the case of dynamic HDR metadata (and also static HDR metadata in many cases), the decoder needs to be able to parse the metadata from the compressed bitstream (in order to pass the metadata through to the renderer / display hardware).

jernoble commented 5 years ago

In that case, do we need to specify each subtype of metadata to query whether the decoder supports each individually, or would it be sufficient to add a "HDR" boolean to VideoConfiguration, and mandate that decoders which advertise support for "HDR" must be able to provide all the necessary metadata to the renderer and display hardware? In other words, could we make this an 'all-or-nothing' check?

jyavenard commented 5 years ago

Having thought further about it, I'm concerned that querying the display capabilities give too much fingerprinting abilities. After all, all that really matter as far as content is concerned is that the decoding side of things is handled properly. After all, there are still benefits to receive HDR content even with a SDR screen.

Regardless of what we add to VideoConfiguration, it appears to me that we'll never cover all cases anyway. So I kind of like a HDR bool that is all or nothing and only in VideoConfiguration.

@jernoble av1 has all the information you typically found in the container, in the frame header: colorspace, range, primaries, coefficient, transfer characteristics etc (the way all codecs should have been :))

jpiesing commented 5 years ago

I'm confused about this proposal to have an HDR boolean. How does an app provider know which content to offer the user if they don't know whether the user can consume HLG10, vanilla PQ10, PQ10 with one of the 3 variations of dynamic mapping metadata? HLG10 is the only one of these that is backwards compatible with SDR.

I'm also confused about the idea of fingerprinting using this data. If all recent Apple products support one particular set of technologies, how much fingerprinting data does that provide? If all 2019 TV sets from Samsung support one particular set of technologies and all 2019 TV sets from LG support a different set, how much fingerprinting data does this provide?

jernoble commented 5 years ago

How does an app provider know which content to offer the user if they don't know whether the user can consume HLG10, vanilla PQ10, PQ10 with one of the 3 variations of dynamic mapping metadata?

My comment was only about VideoCapabilities, which is a proxy for the decoding system, which in turn doesn’t care about PQ vs. HLG.

I'm also confused about the idea of fingerprinting using this data. If all recent Apple products support one particular set of technologies, how much fingerprinting data does that provide?

The danger with HDR comes from being able to query the abilities of the display. Even for devices with built-in screens, they can be plugged into external monitors with different capabilities. Those combinations of capabilities can be extremely unique.

GurpreetV commented 5 years ago

We want to alleviate fingerprinting concerns.

A proposal to just add a boolean for HdrCapability to both MediaCapabilities and Screen would satisfy the fingerprinting concerns without compromising on the scenarios as they appear.

To give more info, adding the same boolean to Screen would be important to give flexibility to website developers to decide if they should serve HDR content based on whether the display supports it or not. Having the ability to make this decision is important especially because there can be power (for the user agent) & network (for the content provider) implications if the content provider chose to serve HDR content even if the Screen did not support it. So as long as we give them an ability to make this decision consciously, it is good enough.

We would still want to keep the HdrMetadata for the reasons @isuru-c-p mentioned above

While this is correct, in the case of dynamic HDR metadata (and also static HDR metadata in many cases), the decoder needs to be able to parse the metadata from the compressed bitstream (in order to pass the metadata through to the renderer / display hardware).

jpiesing commented 5 years ago

The danger with HDR comes from being able to query the abilities of the display. Even for devices with built-in screens, they can be plugged into external monitors with different capabilities. Those combinations of capabilities can be extremely unique.

We want to alleviate fingerprinting concerns.

I'm sorry but I don't understand these. My employer is a large manufacturer of monitors and TVs. I asked a colleague about information carried in HDMI and what might be vulnerable. What he said was the following;

Regarding HDMI and HDR specifically, displays can expose the following: • Display primaries (RGBW x,y coordinates) • Supported colorimetry (e.g. BT.2020 RGB) • Supported HDR transfer functions (e.g. ST2084, HLG) • Display luminance (expressed as desired max, desired min, desired max frame-average for optimal rendering by the display). • Supported dynamic HDR metadata (Dolby Vision, HDR10+, SL-HDR2). • Detailed private data for Dolby Vision if supported.

The primaries used are often not the precise display panel primaries, but just those of BT.709 or DCI-P3.

TVs don’t expose luminance information to my knowledge. Some monitors do expose the luminance information, as I think it is required for the VESA DisplayHDR logo program.

Using the closest typical/generic values (e.g. min 0.05, max 1000cd/m2) for luminance instead of precise device-specific values could hamper fingerprinting.

Over HDMI, the HDR-related information in the EDID that most facilitates fingerprinting would be the Dolby Vision data. As it has to be display-specific, it is very precise.

I don't believe it was ever to expose the detailed data specific to the technology he mentions.

What am I missing?

jernoble commented 5 years ago

Generally speaking, it only requires 33 bits of entropy in order to uniquely identify a user by fingerprinting, and these bits of entropy are cumulative. So the concern is not that exposing detailed device information alone will be able to uniquely identify a user, but that in combination with all the other sources of entropy available pages will be able to do so. "Does this display support HDR or not?" is one bit of entropy [1] (out of 33 total). "Does this display support ST2084 but not HLG?" is another two. "Does this display support Dolby Vision, but not HDR10+ and SL-HDR2" is another three. "What is the display luminance?", if expressed as a floating point number, could be as many as 32 bits of entropy.

[1] This is a theoretical maximum amount of entropy. If everyone in the world gave the same answer to that question, it wouldn't really add fingerprinting risk. So it's not as useful to be able to determine that "this user is on an iPhone", which isn't very unique, as it is "this user is attached to a LG model 27UK650 external display and their brightness setting is 55".

jernoble commented 5 years ago

There's a lot more information here, Privacy WG's Fingerprinting Guidance document.

GurpreetV commented 5 years ago

Given the concerns of finger printing and given that we really don't need more than a Boolean to represent HDR support for the major scenarios, I think we no longer need to discuss having more granular data representing the device and can keep it simple. I have updated the pull request with the new proposal. @jernoble , @chcunningham , @jpiesing , @jyavenard, @mwatson2 can you see if the latest pull request seems aligned with you all. We will add a similar Boolean for HDR support to screen.

chcunningham commented 5 years ago

Help me clarify the meaning of the boolean. The latest PR update defines it as "hasHdrMetadata". I think we want to avoid having to say "we support all forms of HDR metadata". Can we reliably infer the type of metadata from the VideoConfiguration contentType? DV has its own contentType strings (e.g. codecs="dvhe.05.07"), so infer SMPTE ST2094-10. But for the other codecs they're not tightly bound to either SMPTE ST 2086 nor SMTPE ST2094-40 nor whatever future metadata spec arises. What does the boolean mean for non DV codecs?

Giving it a clear meaning is good to avoid ambiguity about future metadata formats. But also, even for known formats, UAs are likely to support just a subset. I can't predict whether Chrome will ever support DV. I also expect support for ST2094-40 to be spotty for many UAs for some time.

Re: fingerprinting, the Chrome security team's position is nuanced. Please have a read here. In short, I'm happy to consider alternatives to the buckets above, but I'm not personally worried that these APIs are meaningful additions to the required 33 bits.

jyavenard commented 5 years ago

In short, I'm happy to consider alternatives to the buckets above, but I'm not personally worried that these APIs are meaningful additions to the required 33 bits.

I think you should be, however, having said that I don't think that the decision to okay or not as far as fingerprinting goes for the inclusion of such feature in the spec should be left to a single person.

Maybe this is something we can put on the agenda for when the new media WG meet at the next TPAC.

jpiesing commented 5 years ago

Given the concerns of finger printing and given that we really don't need more than a Boolean to represent HDR support for the major scenarios,

Again I apologise if I'm missing something but please can you point me to where the major scenarios are documented and where this analysis is recorded? Thanks.

chcunningham commented 5 years ago

Maybe this is something we can put on the agenda for when the new media WG meet at the next TPAC.

Definitely (see you there). Meanwhile, lets keep discussing how a boolean would work. See my questions above; its not clear to me that its viable yet.

chcunningham commented 5 years ago

Again I apologise if I'm missing something but please can you point me to where the major scenarios are documented and where this analysis is recorded? Thanks.

I don't think we've walked through the scenarios yet ;). Could someone advocating for the boolean walk us through a few use cases? Example to get get us started: say a site has HDR10 and HLG options for a given stream and wants to select the option supported by the UA+Screen. See also my questions above.

jpiesing commented 5 years ago

Example to get get us started: say a site has HDR10 and HLG options for a given stream and wants to select the option supported by the UA+Screen.

That's an excellent starting point.

A more advanced variation on the same theme is where the site has HDR10, HLG, HDR10+ and Dolby Vision options for a given stream where the HDR10+ / Dolby Vision versions cost more than the HDR10/HLG versions.

vi-dot-cpp commented 5 years ago

For sure those are valid scenarios. What do you all think about exposing the hdrMetaData enum in VideoConfiguration for the purpose of acknowledging user agents' varying support for different metadata types? I am not an expert on fingerprinting, but if I understand correctly, the concern comes from display capabilities as opposed to decode capabilities -- correct me if I misunderstand @jernoble @jyavenard .

Regarding (video) rendering & display capabilities, are there reservations against a single bool representing all-or-nothing support for relevant transfer functions and color gamuts? Combining them into a single binary property hopefully mitigates fingerprinting concerns while still providing desired functionality to user agents and sites -- thoughts?

jyavenard commented 5 years ago

That's right, the concern is getting info related to a specific user. If it's decoding capabilities, then it can be assumed to be the same for all users of that user agent, which you can already get.

I still believe that there's a clear advantage in playing HDR content even over an SDR display.

So while a generic boolean is preferred for the display, being able to distinguish for decoding capabilities is probably okay.

I'll side with whatever Jer will say here as he's proven to better express what I intend to write :)

jernoble commented 5 years ago

Yes the question I had was whether there were valid scenarios where a UA might support, for example, decoding AV1 but only with HDR10+ metadata, and at the same time HEVC, but only with HLG (a totally contrived scenario, granted) while at the same time advertising display support for both HDR10+ and HLG. If this kind of scenario were possible, and we wanted to support it, then the hdrMetadata boolean would be insufficient, as pages who tried to decode AV1-with-HLG would inexplicably fail.

So, other implementors, is the above scenario plausible?

mounirlamouri commented 5 years ago

I will let the area experts talk about the technical details but I wanted to point out that protecting fingerprinting doesn't have to be a all or nothing scenario. We can have an API that offers ways for a UA to either get user consent (decodingInfo() is async for example) or has recommendations to anonymise the info if the UA wants to do so. I would recommend to explore what the best API is then figure out what the fingerprinting story is and how it can be mitigated.

vi-dot-cpp commented 5 years ago

@jernoble Yes the question I had was whether there were valid scenarios where a UA might support, for example, decoding AV1 but only with HDR10+ metadata, and at the same time HEVC, but only with HLG (a totally contrived scenario, granted) while at the same time advertising display support for both HDR10+ and HLG. If this kind of scenario were possible, and we wanted to support it, then the hdrMetadata boolean would be insufficient, as pages who tried to decode AV1-with-HLG would inexplicably fail.

In my last message, I suggested adding the HdrMetaData enum to VideoConfiguration, which should guard against this scenario. Sites can query decodingInfo with codec and HdrMetadataType simultaneously, and user agents are empowered to return supported/not-supported depending on the combination. Additionally, doing so should not increase fingerprinting risk as noted above.

It also hopefully address scenarios posed by @chcunningham and @jpiesing. User agents' decoders are not required to support every single HdrMetadataType, while sites have more granular information.

As the proposals may have gotten convoluted, here is the summary with updated definitions below. Pull request #124 currently reflects the updated proposal to MediaCapabilities; it will also note proposals to Screen momentarily.

Add HdrMetadataType enum to VideoConfiguration in MediaCapabilities to be queried by navigator.mediaCapabilities.decodingInfo().
- Sites can query support for different combinations of codecs and HDR metadata types.
- User agents can provide decoding support for different combinations of codecs and HDR metadata types. Note that rendering support is excluded.
[Amends #119] Add hasHdrCapabilities bool to VideoOutputConfiguration in Screen to be queried by window.screen.video.displayInfo().
- hasHdrCapabilities bool represents all-or-nothing HDR display & rendering eotf & color gamut.
- Note that hasHdrCapabilities has no implications for HDR metadata types.

As long as navigator.mediaCapabilities.decodingInfo() and window.screen.video.displayInfo() both return supported, sites should provide HDR content and user agents should play HDR content. I acknowledge that not all HDR display and rendering capabilities are being checked; in these corner cases, HDR content should be played anyway with lower fidelity. @content providers and @other implementors, please comment.

MediaCapabilities (decoding capabilities)

HdrMetadataType enum

enum HdrMetadataType {
    "smpteSt2086",
    "smpteSt2094-10",
    "smpteSt2094-40"
};

VideoConfiguration extension

dictionary VideoConfiguration {
    ...
    HdrMetadataType hdrMetadataType;
};

Screen (display & rendering capabilities) [amends #119 Section 3.1]

3.1 VideoOutputConfiguration dictionary

dictionary VideoOutputConfiguration { 
    required unsigned long width; 
    required unsigned long height; 
    required bool hasHdrCapabilities;  // represents all-or-nothing HDR eotf & color gamut
};

chcunningham commented 5 years ago

@jernoble wrote:

So, other implementors, is the above scenario plausible?

I understand the question to be: is it possible a UA will only support certain metadata+codec combinations. I think the answer from chrome is "yes", but I'd also raise a broader question: Is it possible a UA will only support certain metadata (irrespective of codec). Hopefullly metadata will continue to follow the trend @jyavenard pointed out of being container-based, so it becomes less of a codec question. Still, chrome currently only does hdr with VP9, optionally using 2086 static metadata. We'll eventually support hdr AV1 (may already, I need to double check) and perhaps additional forms of metadata, but I think the history here shows we wont want to assume UAs implicitly support all forms of HDR metadata that might pair with a supported codecs.

Let me also quote this bit from hubbe's Chrome HDR doc:

A group of companies got together and came up with an alternate standard for metadata called HDR10+ which is standardized in SMTPE ST2094-40. However, ST2094 is a very complicated standard, which seems difficult to implement efficiently. Since the standard is also relatively new, little or no content actually has this metadata available yet. The complexity of the metadata also means that it’s likely that support for it will be spotty and unpredictable.

@vi-dot-cpp, thanks for the updates. I'm still reading these. Will reply by EOW.

jpiesing commented 5 years ago

@jernoble wrote:

Yes the question I had was whether there were valid scenarios where a UA might support, for example, decoding AV1 but only with HDR10+ metadata, and at the same time HEVC, but only with HLG (a totally contrived scenario, granted) while at the same time advertising display support for both HDR10+ and HLG. If this kind of scenario were possible, and we wanted to support it, then the hdrMetadata boolean would be insufficient, as pages who tried to decode AV1-with-HLG would inexplicably fail.

So, other implementors, is the above scenario plausible?

Certainly yes in the case of an implementation where the UA delegates video decoding to hardware and the HDR decoding is integrated into each video decoder. The AV1 decoder and the HEVC decoder would likely be separate hardware blocks and could easily have different HDR capabilities depending on what's found in the real world. Monitors are more likely to support a union of all commercially relevant HDR solutions although different organisations may have different understandings of what is commercially relevant :)

jpiesing commented 5 years ago

That's right, the concern is getting info related to a specific user. If it's decoding capabilities, then it can be assumed to be the same for all users of that user agent, which you can already get.

See my previous comment - what about UAs that delegate video decoding to hardware?

I still believe that there's a clear advantage in playing HDR content even over an SDR display.

Have you discussed this with content people? Based on previous experience, I expect they would prefer to deliver SDR content to SDR displays rather than trust a random system component to translate HDR into SDR.

So while a generic boolean is preferred for the display, being able to distinguish for decoding capabilities is probably okay.

Why? In theory, an HDMI monitor may support HLG, PQ10 without any of the ST2094 variations and PQ10 with any of the ST2094 variations.

gregwfreedman commented 5 years ago

I still believe that there's a clear advantage in playing HDR content even over an SDR display.

at Netflix, almost all HDR content is manually trimmed by content creators to generate SDR streams. and for some content this is done for every scene, so we definitely don't want to stream HDR content to an SDR display and let the UA do it.

chcunningham commented 5 years ago

In reply to @vi-dot-cpp above, I think this does a good job of accounting for the discussion so far. I also think the discussion lost track of eotf and color gamut. Let me try to bring those back by comparing two UAs: Chrome and Chromecast (warning, my knowledge is approximate ;)).

Chrome: For now Chrome only supports HDR (VP9 profile 2) on Windows. Chrome's color management code runs the transfer function and ultimately delivers the frame to Windows represented as linear half-floats (some discussion in this doc). This transformation leverages the GPU for some heavy lifting. Chrome's color management code supports most transfer functions and color gamuts. Windows relies on the display drivers to convert the half-floats to something the display can use (doc above mentions PQ supported, but not HLG - doc is old, so not sure where things stand today).

Cast: The hardware inside the cast dongle is very limited, meaning it cannot perform the transformations described above. It can decode the content, but must then deliver it to the TV. If the content is HLG and the TV is PQ, things don't go so well. So for cast, it makes sense to advertise support for only the eotfs available on the TV.

So Chrome can get away with not asking about eotf, but Cast definitely can't. Reading comments from @jpiesing , I understand limited eotf support is common for TV UA implementations in general.

What you can also see from the above is that the line between display vs rendering capability can be fuzzy (or at least scenario specific).

Here's a strawman for something that might work:

MC.decodingInfo(): add the buckets from scott's earlier comment (ColorGamut, TransferFunction, Metadata)
- Screen: add the single hasHdrCapabilities bool you mentioned above.

In Chrome: decodingInfo() will say "supported=true" for any combination of ColorGamut and TransferFunction. We'll say supported=false if you throw in a metadata. But Chrome users seldom have an HDR display, so it remains important that sites check the screen bool to determine what stream to send. We'll have to figure out what criteria make the bool true/false... another day.

In Cast: decodingInfo() will say "supported=true" only for the ColorGamut, TransferFunction, and Metadata supported by the attached screen. The hasHdrCapabilities capabilities boolean will return true for hdr screens. The boolean is redundant here... screens that support PQ are HDR screens, but that's OK (the bool proves more useful for the Chrome case).

This continues to limit what we expose about the screen (the primary fingerprinting concern above), and is hopefully more complete (please scrutinize thoroughly).

Also, some revisions to things I said earlier...

Still, chrome currently only does hdr with VP9, optionally using 2086 static metadata.

Correction: For now Chrome ignores all metadata. I was confusing Chrome vs YouTube, which appears to require metadata on its HDR uploads.

We'll eventually support hdr AV1 (may already, I need to double check)...

In theory the code is all there, but I don't have a stream/monitor to test it. We still aren't doing anything with the metadata here.

vi-dot-cpp commented 5 years ago

@chcunningham Thanks for the thorough analysis and proposal. I am looking this and will reply EOW.

vi-dot-cpp commented 5 years ago

LGTM. I will update the PR in a few days; maybe folks will have had a chance to review and chime in by then.

Just a thought -- would it make sense for MC.decodingInfo() to display capabilities even for non-Cast scenarios? In which case, Screen's hasHdrCapabilities boolean is no longer necessary.

Regarding finger-printing, I want to echo the TAG's guidance in https://www.w3.org/2001/tag/doc/unsanctioned-tracking/ :

[the TAG b]elieves that, because combatting fingerprinting is difficult, new Web specifications should take reasonable measures to avoid adding unneeded fingerprinting surface area. However, added surface area should not be a primary factor in determining whether to add a new feature.

And @mounirlamouri's recommendation:

I will let the area experts talk about the technical details but I wanted to point out that protecting fingerprinting doesn't have to be a all or nothing scenario. We can have an API that offers ways for a UA to either get user consent (decodingInfo() is async for example) or has recommendations to anonymise the info if the UA wants to do so. I would recommend to explore what the best API is then figure out what the fingerprinting story is and how it can be mitigated.

If the current proposal is technically optimal but needs additional protection against fingerprinting, How about we couple the API with user consent (in similar fashion to getUserMedia() or xr.requestSession()) and/or anti-fingerprinting browsing mode (#48).

jpiesing commented 5 years ago

With apologies for the length, here is a very simplified description of how we see this complex problem space.

Short version

We believe the key problem to be solved is giving web video content providers the information needed to identify which version of a particular piece of content will give the best results for the particular combination of UA + hardware platform + screen someone is using. In practice this means enabling the web video content provider to choose between some subset of the following versions;

BT.709 SDR
BT.2020 SDR
BT.2020 HDR HLG (the same content can also cover BT.2020 SDR)
BT.2020 HDR PQ10 without any of the ST 2094 technologies
BT.2020 HDR PQ10 with ST 2094-10
BT.2020 HDR PQ10 with ST 2094-40 (the same content may also cover PQ10 without any ST 2094)

Not all of these will be commercially relevant for all content providers . Some combinations of UA + hardware platform + screen will give good results for more than one of these & some choices may be subjective.

Allocating ST-2094 to either the decoder/renderer or to the screen seems a little artificial as ST2094-10 and ST2094-40 are somewhat divided between the two. The real issue is which of these can the UA accept regardless of where the processing actually happens.

Where the capabilities of the screen do matter is when compositing graphics with HDR video. That is an even larger problem.

Long version

An illustration of the primary screen characteristics can be found in the requirements of the UHD Alliance "UHD Premium" logo (see https://www.experienceuhd.com/uhd-premium-features). Note that the requirements it contains are aimed at high end consumer displays and mass market consumer displays will not meet the specific values. 3 key properties are identified.

1) Colour space - screens can be BT.709 or BT.2020. P3 is essentially a subset of 2020 - UHD Premium requires "BT2020 color representation with display reproduction of more than 90% of P3 colors." 2) Peak brightness and black level (measured in nits) - UHD Premium requires from 0.05 to at least 1,000 ‘nits’ (LCD panels) or from 0.0005 to at least 540 ‘nits’ (OLED panels). Professional displays can go as high as 4000 nits. 3) Bit depth - UHD Premium requires at least 10 bits.

A fourth property not mentioned by UHD Premium is the transfer function.

Colour Space

For colour space, it's possible for the UA+hardware platform to convert from BT.2020 to BT.709 and vice versa. 2020 to 709 conversion is believed to need a lot more care & has a reputation for being done badly. There are multiple methods of mapping 2020 to 709 with different trade-offs (such as those defined in ITU-R BT.2407). We believe many content providers want to deliver content that avoids the UA+hardware doing 709 to 2020 or 2020 to 709 mapping. In the case of HDMI, the UA+hardware may not know what the native colour space of the display is. HDMI displays typically report all colour spaces they can support without any indication of priority, preference or quality. A UHD Premium display (see above) and a very cheap display (supporting a much smaller subset of the whole BT.2020 colour space) may appear the same to devices connected via HMDI.

Transfer Function

For BT.709 there is in practice only one transfer function – often called “traditional gamma”. For BT.2020 there are 3 possible transfer functions – “traditional gamma” (i.e. SDR), HLG and PQ10.

HLG is backwards compatible to “traditional gamma” – that is its main reason for existing. PQ10 is not.

Brightness

For brightness, mapping between brightness in the content and the light output by the panel is normally done in the screen. Screen manufacturers see this as something that they can use to differentiate their products and the maximum brightness of the panel & colour gamut would not normally be available to the UA+hw let alone the app. We have not seen any evidence of content providers offering different versions of HDR content via the same distribution channel with 500 nit and 1000 nit. HDR content is typically produced optimised for 1000 nit or 4000 nit.

The transfer function is one part of this mapping. There are however additional technologies to optimise this – the various parts of ST 2094.

In theory it is possible to have a single piece of PQ10 content that works OK on a UA+hardware+display supporting PQ10 (without any ST2094) and just works better if either 2094-10 or 2094-40 is supported. Unfortunately it seems the real world doesn't work that way. There are variations of 2094-10 that don't work OK on UA+hardware+display that don't support 2094-10. Content supporting both 2094-10 and 2094-40 in the same bitstream seems to be very unusual. It's not clear to us if this is something that will change over time or the consequences of some practical issues in content production.

There are also other parts of ST 2094 but these have less market adoption than -10 and -40.

Bit depth

Screens will handle adapting bit depth to the depth of the screen. For a 10-bit panel, the screen will scale 8-bit content and (for HDMI) 12-bit content to 10 bits. We have not seen any evidence of content providers offering different versions of HDR content via the same distribution channel with different bit depths.

Conclusion

There are a number of possible shapes for the API.

3 properties
- Colour space – BT.709 and/or BT.2020/P3 - devices with an integrated display may be able to report accurately one of these. HDMI devices will have to report more than one.
- Transfer function – “traditional gamma” vs HLG vs PQ10 - UA+hardware+display may support more than one of these.
- Dynamic metadata – none vs ST 2094-10 (aka Dolby Vision) vs ST 2094-40 (aka HDR10+) - UA+hardware+display may support more than one of these.

These would appear to result in a sparse matrix of 2x3x3 = 18 combinations of which 6 exist in the real world.

A single property which returns an array of enums where each enum represents a single combination of native colour space, transfer function and dynamic metadata.

We have no strong preference between these except to note that the single property seems simpler and gives the appearance of less providing less data for fingerprinting.

What does seem strange to us is separating rendering from display. It doesn’t make a difference to web video content providers. In the case of HDMI or cast, dynamic metadata support is distributed between renderer or display in implementation-specific ways. The division between renderer and display that may make some sense in the case of something connected to a display over HDMI makes less sense in the case of something with an integrated display.

mwatson2 commented 5 years ago

One possible source of confusion here is that there are three components, the capabilities of which we are interested in, whereas we have been focussing on two discovery mechanisms (decodingInfo() and screen). decodingInfo() suggests actual decoding capabilities (codec), which does not include things like Color Gamut, Transfer Function and Metadata. But these do not belong on screen capabilities either.

Between the codec and the screen, as @chcunningham describes, is a component which may or may not be able to map between a given codec output format and a format supported by the display. We have cases where this component is fully functioned (e.g. a desktop OS that runs everything through a canonical half-float linear conversion space) and where is it limited (e.g. TV SoC with limited conversion capabilities).

I would call this component a "renderer" - and will do in the rest of this post - but there might not be consensus that is a good term to use in the spec.

What the site needs to know is two things:

whether it is possible for the decoder, renderer and current display to decode, render and display a given format
if several formats are supported by the decoder, renderer and current display, which one to use ?

I do think a boolean is sufficient for the second of these, as I can think of only one presently unsupported case where the answer to this question is not the 'highest quality' (in the opinion of the site) format choice. This is the case of an SDR display. In that case, if I have an SDR stream available, I would prefer to deliver that, since it has been explicitly graded for SDR displays.

So, this is a long way of saying that the proposal for a boolean on screen.video, with all the other properties included in the VideoConfiguration seems reasonable, except that decodingInfo is not the best name for the method any more: the scope is wider than decoding, encompassing the combination of decoding, rendering and display capabilities.

jernoble commented 5 years ago

I mentioned in a telecon the Privacy Interest Group's Mitigating Browser Fingerprinting in Web Specifications guidance document which lays out best practices for evaluating the severity of fingerprinting, as well as discusses possible mitigations for fingerprinting. With that in mind:

Fingerprinting type: Active. Requires executing JavaScript on the local client.
Fingerprinting surface:
- Entropy: adding at least three new properties to VideoDecodingInfo has the potential to add a great number of bits of entropy. However, for UAs whose decoders are implemented in software and therefore whose capabilities are fixed across all users and devices using that UA, this adds no effective additional bits of entropy beyond the UA version found in the User-Agent string. Similarly, for UAs whose decoders are implemented by the platform, but run on a platform whose decoding capabilities are fixed, this adds no effective additional bits of entropy beyond the platform also found in the User-Agent string. For those UAs whose decoding capabilities are tied to a large number of diverse platforms, this could add a large amount of effective bits of entropy.
- Detectability: Because this feature requires directly calling a Web API, it is very detectable by UAs, and does not represent an increased risk of passive fingerprinting.
- Availability: This feature is not currently behind any permission prompt, implicit or explicit. As such, it is available to all web pages.
- Scope: This feature is available across all scopes and origins, and returns consistent information across all origins.

jernoble commented 5 years ago

In the Mitigations section of that document, the Privacy Interest Group suggests three best practices:

Best Practice 1: Avoid unnecessary or severe increases to fingerprinting surface, especially for passive fingerprinting.
- Since the API already returns only a single boolean per set of input, this best practice seems to have already been considered.
Best Practice 2: Best Practice 2: Narrow the scope and availability of a feature with fingerprinting surface to what is functionally necessary.
- One possible mitigation for an increased fingerprinting surface would be for UAs to limit the use of this API to top-level browsing contexts.
- Another possibility is to limit the entropy by exposing the same capabilities for all devices and installations of the UA; this would be relatively easy for UAs whose capabilities are entirely within their control, as above, but relatively difficult for UAs whose capabilities are outside their control.
Best Practice 3: Mark features that contribute to fingerprintability.
- This is entirely a spec author mitigation: the spec should indicate that there is a potential contribution to the fingerprinting surface for decodingInfo().

vi-dot-cpp commented 5 years ago

@jernoble thanks for helping with the fingerprinting analysis (this will be a great example going forward as well).

Regarding next steps:

Best practice 2a (limit to top-level browsing contexts) and 3a (spec indicates potential fingerprintability) are straightforward to add to the PR.
What do you recommend in terms of addressing 2b? Could this be a piece of guidance in the spec, and the implementation would be up to the UA?

gregwhitworth commented 4 years ago

Another possibility is to limit the entropy by exposing the same capabilities for all devices and installations of the UA; this would be relatively easy for UAs whose capabilities are entirely within their control, as above, but relatively difficult for UAs whose capabilities are outside their control.

@vi-dot-cpp and I had a chance to sit down and discuss this. We don't understand how this could work because if you're always returning the same results you're removing the point of the API, unless we're misunderstanding what you mean here.

With regards to the others, we agree that limiting to the same origin will help avoid advertisements or trackers being able to obtain this information.

@jernoble @chcunningham @mwatson2 @gregwfreedman @jyavenard What are your thoughts, to summarize HDR detection will be:

It needs to be query based in that it only returns the single bool per input set
Restrict it to same origin scripts only
Normative spec prose regarding fingerprint impact

Any objections to us adding in those spec changes and resolving on this issue?

jernoble commented 4 years ago

@gregwhitworth said:

@vi-dot-cpp and I had a chance to sit down and discuss this. We don't understand how this could work because if you're always returning the same results you're removing the point of the API, unless we're misunderstanding what you mean here.

Sorry if this was confusing! What I meant by this mitigation was that a given UA could always return the same values given the same set of inputs across all users & platforms, not that a given UA will always return the same values across disparate set of inputs.

mwatson2 commented 4 years ago

I'm not sure what is meant by limiting to "same origin" as distinct from limiting to "top level browsing context" which is what @jernoble proposed. I'm also not sure if that is necessary, or, rather, shouldn't there be a consistent policy for all kinds of feature detection as to whether it can be done in non top-level contexts or not ?

As a general rule, it's obvious that if the user wants to have their experience tailored to their device capabilities then they are going to need to expose this information to the site. For users who prefer not to expose capability information they will necessarily get a least-common-denominator experience which uses only a standard set of capabilities available (almost) everywhere. I thought that the way users express this choice is through choosing a "private" browsing mode in which the browser exposes only that common set of capabilities.

So, then, we should not be concerned about providing capability discovery outside this mode, except to be sure we don't expose more information than strictly necessary (per the guidelines).

It would seem to me valuable for different browsers' private browsing modes to expose the same sets of capability values, except that there are still UA strings ...

jernoble commented 4 years ago

Slightly OT: @mwatson2, it's a (common) misconception that Private Browsing Mode is meant to limit the amount of information websites store about the user; at least in the case of Safari, Private Browsing mode is meant to limit the amount of information the device itself stores about the user. It sometimes has the side effect of hiding information from websites by (e.g.) not reading or writing cookies from disk, but that's not (again, in Safari's case) the 1st order intent.

That said, I believe Firefox has a "Reduced Fingerprinting" feature which may expose–as you say–only a common set of capabilities.

w3c / media-capabilities