Add Support for Querying HDR Decode and Render Capabilities

vi-dot-cpp commented 5 years ago

This is part 1, which covers decoding and rendering, of the HDR two-part series. Part 2 (#119) covers display.

Modern day scenarios, based on data & partner asks we have analyzed, are increasingly requiring the need of HDR capability detection in v1. We let the following design considerations guide this proposal:

Separate decoding & rendering capabilities (MediaCapabilities) and display capabilities (Screen). Relevant threads/comments: [1][2][3][4][5][6]
Bucket vs. granularity for HDR Capabilities. See: $todo in explainer.md#HDR
Distinguish graphics and video capabilities #25
Limit finger printing

We propose the following changes to MediaCapabilities. These changes will be complemented by changes to Screen in the aforementioned linked issue.

Add a bucketized HdrCapability enum to VideoConfiguration in similar fashion to Android’s HdrCapabilities.
- A bucketed approach solves nuanced granular properties like EOTF, color depth, and color gamut [#10][ #110-comment]
- A bucketed approach also addresses the limitation that granular aspects like frame metadata are not standardized
- There are environments that currently support playing HDR content on certain SDR hardware for a pseudo-HDR experience
  1. Define HdrCapability Enum
  
  Shared in Screen and MediaCapabilities:
```
enum HdrCapability {
“HDR10”,  
“HDR10Plus”,  
“DolbyVision”,  
“HLG”; 
}; 
```
  2. Add HdrCapability Enum to VideoConfiguration
```
dictionary VideoConfiguration { 
… 
HdrCapability hdrCapability; 
}; 
```
  Team: @scottlow @gurpreetv @isuru-c-p @vi-dot-cpp from Microsoft

gregwhitworth commented 5 years ago

@mwatson2 it wasn't meant to be different, I was agreeing - sorry for the confusion.

Based on the discussion however, do we have agreement that if we limit the feature to:

It needs to be query based in that it only returns the single bool per input set
Restrict it to the top level browsing context
Normative spec prose regarding fingerprint impact

We've allowed both the sites to provide a richer experience and are doing best efforts to avoid passive finger printing? If that is the case, we'll make a PR as such to adjust the spec where necessary to close on this issue.

With regards to the 'in-private' being a limitation on finger-printing I like that idea.

mwatson2 commented 5 years ago

@jernoble Thanks for the clarification. Let me put my point another way: whether or not to hide capability information - by exposing a common lowest-common-denominator set of capabilities or disallowing some class of capability discovery - is a browser decision, not a spec decision. This is because hiding the capability information changes the user experience as well as limiting fingerprinting.

How the browser makes this decision is browser-specific, depending on the browsing modes it offers and what those modes promise to users. A browser that was intended to offer privacy above all else might never expose capability information.

So, this is to ask whether the question of exposing capability information to non-top-level browsing contexts is really one we should make in the spec or should be left to browsers ?

jernoble commented 5 years ago

@mwatson2 said:

So, this is to ask whether the question of exposing capability information to non-top-level browsing contexts is really one we should make in the spec or should be left to browsers ?

I believe the intent of at least some of these mitigations is to allow UAs to do the mitigating. For example, see Best Practice 7 in the mitigation document: "Enable graceful degradation for privacy-conscious users or implementers." Degrading the output of decodeInfo() in non-top-level browsing contexts could be a UA mitigation.

That said, I think it's worth noting these as potential mitigations in the spec itself, but as a non-normative note, rather than a normative section.

chcunningham commented 5 years ago

That said, I think it's worth noting these as potential mitigations in the spec itself, but as a non-normative note, rather than a normative section.

Agree with @mwatson2 and @jernoble - I prefer not to formally require a particular mitigation. New/improved mitigations will arise and each UA will do it differently. For ex, the latest thinking in Chrome-land is to use a "privacy budget" that throttles/blocks calls to the API above a certain threshold (distinguish fingerprinting from legitimate use).

It needs to be query based in that it only returns the single bool per input set

Normative spec prose regarding fingerprint impact

Do these remaining points imply a change to the spec/PR (vs just forming points of agreement)? IIUC, #1 is already true. We have a nod to #2 here - @jernoble do you think this should be amended (e.g. more complete description of the fingerprinting surface)?

Switching gears for a sec, I want to return to some discussion of the colorGamut property that came up near the end of our recent meeting. Quick summary:

colorGamut's values are borrowed from CSS media queries,
CSS is describing the attached screen, whereas we mean to describe what colors the UA's videostack understands (can "render")
VP9 and AV1 codec strings describe colors using color primaries, color matrix, and eotf. They do so using the values from ISO/IEC 23001-8:2016 (free).

Picking back up with new info/questions

@mwatson2 was part of the vp9 codec string discussion (recommended reading) and supported using those "code points"
on the call there was some discussion about whether support for a matrix could be inferred by support for primaries. I'm not savvy enough to say what the gotchas might be. I note @mwatson2 advocated to include the matrix in the vp9 string. Please discuss :)
big picture: I'm getting the sense that ISO_IEC_23001-8_2016 is a preferred language for describing video color. Would folks be supportive of a reference on that spec for eotf, color primary, and perhaps matrix coefficient enums?

gregwhitworth commented 5 years ago

Do these remaining points imply a change to the spec/PR (vs just forming points of agreement)? IIUC, #1 is already true. We have a nod to #2 here - @jernoble do you think this should be amended (e.g. more complete description of the fingerprinting surface)?

Yep, was just trying to get a clear resolution on it all so we can put a wrap on this issue. Let's add number 2 to the PR. Regarding colorGamut, I think it would be best to keep this discussion solely to the HDR fingerprinting issue and since it seems like we're gaining consensus on adding the spec prose/top level browsing context; let's resolve and close on this issue. I've opened a separate issue to flush out the issues/questions you've outlined for colorGamut in #130

@vi-dot-cpp can you add the following to the PR:

spec text regarding the fingerprinting surface of this. And based on the feedback from @chcunningham @jernoble and @mwatson2 avoid outlining any mitigations.
Restrict it to the top level browsing context

Thanks for the quick responses and feedback.

chcunningham commented 5 years ago

Restrict it to the top level browsing context

This is also a mitigation. Please don't add this to the PR.

I think it would be best to keep this discussion solely to the HDR fingerprinting issue and since it seems like we're gaining consensus on adding the spec prose/top level browsing context;

@gregwhitworth can we keep it here? This issue is as much about the interface (including enum values) as it is fingerprinting concerns. As-is, the PR would add a colorGamut property to MediaCapabilities that does yet exist. A handful of folks were concerned this is not quite right, so we should get consensus on that before landing a PR to add it.

chcunningham commented 5 years ago

Greg closed the separate issue (thanks). @mwatson2 @jernoble @jpiesing interested to continue the discussion re: colorGamut vs ISO_IEC_23001-8_2016. See my earlier comment.

gregwhitworth commented 5 years ago

can we keep it here?

@chcunningham that's fine, this thread has already numerous issues so let's keep it here.

With regards to your feedback on colorGamut, let's tackle the CSS one first, the CSS spec states:

The color-gamut media feature describes the approximate range of colors that are supported by the UA and output device. That is, if the UA receives content with colors in the specified space it can cause the output device to render the appropriate color, or something appropriately close enough.

This implies that they're overloading color-gamut for both the rendering capabilities and the display capabilities. That said, a bit further down when defining the color spaces it says:

The output device can support approximately the sRGB gamut or more.

So this seems to contradict the first item as you stated and is only about the display, not the rendering capabilities & the display. I can file an issue and follow up with the CSSWG on a call following TPAC to see which direction they intended for this and we can either amend our spec to build on top of theirs. Or we can see if they'll have the color spec adjusted to align the color space definitions with the earlier paragraph as it doesn't make sense to go down a code path for a color space that the display can support but the UA can't adequately render. I personally think that we want to adjust the spec to the following (for all of the color space definitions):

The output device and the UA can support approximately the sRGB gamut or more.

Would that be sufficient?

gregwhitworth commented 5 years ago

@chcunningham @mwatson2 @jernoble @jpiesing I presume I should move forward with opening an issue on the CSSWG to fix the contradictions between their propdef of color-gamut and that of the color space definitions; correct?

mwatson2 commented 5 years ago

Regarding whether we need to separately specific matrix coefficients, to completely make sense of decoded pixel data you need to know full range flag, eotf, matrix coefficients and color primaries:

the full range flag and eotf specify the mapping between integer values and linear light
the matrix coefficients specific the mapping between YCbCr and RGB
the color primaries tell you exactly what colors R, G and B are

When labelling a video stream, the values of all of these things are known and there is little reason not to declare them all in the codec string. This is just accurate labeling of a stream.

For capability discovery we can get away with a smaller set when it is known that all devices support all relevant values of one of these. Many of the values for color primaries and matrix coefficients in the codec-independent code points document are not relevant in a web context. Specifically, we only care about SDR (709) and BT.2020 for color primaries and there is only one matrix coefficients value used with 709.

I am actually not sure whether it is the case that only one value of the full range flag is used in practice or whether devices universally support both values, but I infer from the lack of problems related to this flag that one of these is true ;-) Same for the two values of matrix coefficients associated with BT.2020, though I do know here that the 'constant luminance' one is not widely supported if at all.

So, for capability discovery we are probably fine with TF and color primaries. Matrix coefficients could be added later on if someone has support for BT.2020 constant luminance and wants that to be discoverable. But this is not so likely to happen as I doubt people will want to double up their streams for the small benefit this option provides.

jernoble commented 5 years ago

@mwatson2 said:

When labelling a video stream, the values of all of these things are known and there is little reason not to declare them all in the codec string.

We’ve been down this road before with EME. Existing codec strings don’t carry this information, and bodies that standardize them are very resistant to putting stream characteristics into the codec string. So not only will this not work for existing codecs and containers, it’s unlikely to work universally for future codecs and containers as well. I don’t think we’re going to be able to get away with putting all this information into the content type.

vi-dot-cpp commented 5 years ago

Thanks everyone for the feedback. Based on our discussion, I have updated #124 to include the following:

transferFunction and colorGamut (as defined by 'color-gamut' in CSS Media Queries) in VideoConfiguration.
updated fingerprinting information in the nonnormative fingerprinting section.

The update is based on:

@jernoble's fingerprinting analysis, and our agreement that best practices have been met.
The discussion regarding colorGamut. Namely that CSS Media Queries defn covers display and UA (but may need update for individual color space defns), and color primaries are sufficient for web capabilities detection.

gregwfreedman commented 5 years ago

i think we're conflating the color-gamut media-query and the ColorGamut enum.

the color-gamut media-query takes a ColorGamut enum as input and tests support by the UA and the output device. the ColorGamut enum values only represent a color space, nothing more. it is the color-gamut media-query which is returning device information for a given color space.

the proposal here is to add the ColorGamut enum to represent a color space, without the color-gamut semantics.

gregwhitworth commented 5 years ago

@gregwfreedman valid point that it's an enum and not necessary what's doing the evaluation of support. That said, I went ahead and filed an issue with the CSSWG spec and they'll be fixing it to reflect rendering & display. https://github.com/w3c/csswg-drafts/issues/4281

@vi-dot-cpp you should be able to either change your PR for this to be a note or remove the description altogether because the CSS spec will be the definition you're expecting.

chcunningham commented 5 years ago

FYI, I'll be largely out of office next week as I head to Japan and squeeze in some tourism before TPAC. Looking forward to a f2f chat!

the proposal here is to add the ColorGamut enum to represent a color space, without the color-gamut semantics.

This is how I understood the proposal. Just want to make sure it has everything we need. Interested to hear @mwatson2 come back on @jernoble's last comment.

@vi-dot-cpp - the PR presently says "The ColorGamut represents the color gamut supported by the UA and output device." I follow that this is the CSS wording, but we should somehow call out that calls to decodingInfo() actually aren't checking the output device. IIUC the plan has been to leave output device queries to the Screen API, meaning color gamut for decodingInfo() is purely a question of what the UA supports.

vi-dot-cpp commented 5 years ago

(@chcunningham) "...but we should somehow call out that calls to decodingInfo() actually aren't checking the output device. IIUC the plan has been to leave output device queries to the Screen API, meaning color gamut for decodingInfo() is purely a question of what the UA supports."

Correct me if I misunderstand -- will there not be UAs for whom decodingInfo() checks the attached screen, e.g., Cast?

Looking forward to a f2f chat!

Some of us will regrettably miss this opportunity; Is calling in an option?

mwatson2 commented 5 years ago

@jernoble wrote:

We’ve been down this road before with EME. Existing codec strings don’t carry this information, and bodies that standardize them are very resistant to putting stream characteristics into the codec string. So not only will this not work for existing codecs and containers, it’s unlikely to work universally for future codecs and containers as well. I don’t think we’re going to be able to get away with putting all this information into the content type.

The VP9 and AV1 codec strings carry this information, but I understand others don't. Let me clarify my point though: I was not proposing we use codec strings for capability discovery past the identification of the codec that is common. I was pointing out the difference between describing stream properties and discovering capabilities, since someone had mentioned that I had argued for matrix coefficients as an item in the VP9 codec string, but in this discussion I think we don't need it.

If you are describing stream properties, then these are just descriptive values and you might as well include everything to be fully descriptive. When discovering capabilities the task may be simplified by known facts of the form "all implementations that support X also support Y" or "no implementation exists that supports P with Q". We don't need to separately specify matrix coefficients for discovery since there is only one relevant value for each color gamut.

Also, in future, if necessary, new capability discovery fields can be added when new capabilities are added to an implementation but it would be much harder to add a field to the codec string since that has no forwards compatibility mechanism and is embedded in many implementations.

poolec commented 5 years ago

Just reviewing the PR and trying to understand what is now being proposed, this text seems ambiguous:

The hasHdrCapabilities member represents all HDR-relevant color gamuts (sRGB, p3, rec2020) and transfer function (sRGB, pq, hlg).

Does hasHdrCapabilities mean all of sRGB, p3 and rec2020 need to be supported and all of sRGB, pq and hlg need to be supported as the current text implies? Or is it intended to be a query covering all capabilities that is considered supported if at least one HDR-relevant color gamut and transfer function is supported (in which case, why list sRGB)?

If we're aiming to have just one boolean then I can see pros and cons with either interpretation and which is best rather depends on how likely it is that a device will support some but not all of the capabilities listed.

At the very least, the wording needs tightening to be clear what is being described.

chcunningham commented 5 years ago

Correct me if I misunderstand -- will there not be UAs for whom decodingInfo() checks the attached screen, e.g., Cast?

This is true, but I think we have to be careful about when we explicitly mention the screen to avoid confusing the reader. The current language makes it sound as if we will only return support for rendering a specific color gamut if the attached screen also supports outputting this gamut. We want to avoid that coupling (having screen output capabilities addressed by the Screen API).

When I mentioned the Cast example earlier this was to motivate the inclusion of eotf. In these cases, the screen line between the display and UA are blurred. There will also be cases where the UA software runs entirely within the display (Smart TVs). But we don't need to bring attention to this fact in the spec because it isn't important for sites to know and it implies the coupling I mention above. IMO the way to draw the line is to continue separate Screen vs Decoding+Rendering such that we only put things on Screen that were traditionally Screen properties (before screens starting building in computers) - things like dimensions, color gamut, hdr support. SmartTVs that act as a UA + Display can continue to answer the non-Screen decodingInfo() questions in the same way we would for a traditional desktop + display.

vi-dot-cpp commented 5 years ago

It was nice to speak with everyone at the TPAC face-to-face and get agreement on this issue. #124 has been updated to reflect suggestions surfaced here and at TPAC.

rdoherty0 commented 5 years ago

I don't like adding vendor-specific names to specifications, so I'm hesitant to enshrine "DolbyVision" into Media Capabilities. I proposed a something similar in #110, but using transfer function, color space, and bit depth.

I realize this is a comment from some time ago, but it may be important to note that Dolby Vision is a superset of SMPTE 2094-10, particularly when it comes to OTT video distribution. See https://www.dolby.com/us/en/technologies/dolby-vision/dolby-vision-profiles-levels_v1.3.2.pdf

I believe this is why the vendor strings were chosen for Android: https://developer.android.com/reference/android/view/Display.HdrCapabilities.html

jernoble commented 5 years ago

@rdoherty0, could you clarify: I don’t see any reference to SMPTE 2094-10 in that document, only SMPTE 2086.

When you say “superset”, do you mean that the bitstream carries multiple metadata formats at the same time? Or that the bitstream is capable of carrying one out of a defined set of metadata formats? The “BL signal cross-compatibility ID” section seems to indicate the latter.

rdoherty0 commented 5 years ago

@rdoherty0, could you clarify: I don’t see any reference to SMPTE 2094-10 in that document, only SMPTE 2086.

When you say “superset”, do you mean that the bitstream carries multiple metadata formats at the same time? Or that the bitstream is capable of carrying one out of a defined set of metadata formats? The “BL signal cross-compatibility ID” section seems to indicate the latter.

There is a lot to unpack here, unfortunately. Your second statement is closer to the truth: there is one complete metadata set per stream. There is more documentation from Dolby here which documents the inclusion of Dolby Vision streams into various formats (DASH, for example): https://www.dolby.com/us/en/technologies/dolby-vision/dolby-vision-for-creative-professionals.html#5

The 2094-10 metadata is used in several standards' based efforts, including ATSC and DVB, and specified in DASH-IF IOP spec. But most Dolby Vision profiles extend this metadata, including the composing metadata specified in the ETSI specification (https://www.etsi.org/deliver/etsi_gs/CCM/001_099/001/01.01.01_60/gs_CCM001v010101p.pdf), which does reference SMPTE 2094-10.

Most online distribution is using Dolby Vision profiles 5 or 8.1.

I would suggest none of this complexity needs to be exposed at this API layer, the simple existence bit as proposed is ok, but it would be not accurate to label the Dolby Vision "family" of HDR metadata as SMPTE 2094-10.

chcunningham commented 5 years ago

Celebrate!!! PR #124 is merged! This includes all the bits we agreed to in this discussion and tpac. It does not include the Screen API changes that are still under discussion.

I'm going to close the out and file a separate issue to see if we should make any revision for the points raised by @rdoherty0.

Thanks everyone!

w3c / media-capabilities

Add Support for Querying HDR Decode and Render Capabilities #118

1. Define HdrCapability Enum

2. Add HdrCapability Enum to VideoConfiguration