[Identity][Enhancement] Expose contentHint

eladalon1983 commented 2 years ago

Content Hints allow an application to instruct the encoder what type of content it could expect, and therefore might be the best type of encoding. It is up to the capturing application to deliver the content hint to the encoder. But it is the captured application which has this information. It would be good if there were a standard way for the capturee to suggest a content-hint to the capturer. If the capturer wishes, it can then use that suggestion.

That is:

Capturee calls setCaptureHandleConfig with a config that includes two fields, suggestedContentHint.
Capturer can ignore or use. If it chooses to use, then it sets mst.contentHint based on this. (Probably to the exact value suggested, but not necessarily.)

Suggested API:

dictionary CaptureHandleConfig {
  boolean exposeOrigin = false;  // Existing
  DOMString handle = "";  // Existing
  sequence<DOMString> permittedOrigins = [];  // Existing
  DOMString suggestedAudioContentHint = ""  // NEW
  DOMString suggestedVideoContentHint = ""  // NEW
};

And the algorithm for setCaptureHandleConfig can validate that the hints must be valid hints. (Doesn't have to - open for discussion).

Then:

dictionary CaptureHandle {
  DOMString origin;  // Existing
  DOMString handle;  // Existing
  DOMString suggestedContentHint;  // NEW
};

One thing we'll be adding here, is that we'll expose captureHandle on all track returned by getDisplayMedia. They'll be identical in some fields (origin, handle) and distinct in others (suggestedContentHint).

aboba commented 2 years ago

I think this could be useful, because the Content Hint might change over time. As a result, just knowing the captured application might not be sufficient. For example, let's say you are doing a slide presentation. Most of the presentation is slides with text, so the "text" content-hint is appropriate for those slides. However, in the middle of the presentation you include a slide with an image on it (e.g. a picture of a bird). Now the "detail" content-hint would be more appropriate. Or perhaps your slide presentation has an embedded video. Once you start to play the video, the "motion" content-hint would be appropriate. oncapturehandlechange would allow the capturer to obtain the Content-Hint as it changes.

eladalon1983 commented 1 year ago

Thank you, Bernard. @youennf and @jan-ivar, any opinions before I send a PR?

youennf commented 1 year ago

IIRC, this was discussed at an interim and there were feedback questioning the actual usefulness. I do not remember the conclusion of this discussion though.

If using WebRTC, there is no need for content hints to be exposed to capturer, User Agent is smart enough to optimise things. From the issue's description, I am not sure what exactly you are trying to solve, can you clarify this?

Also, I am wondering whether this API shape is future proof. For instance, you might require different content hints if starting to crop capture. Given the main goal of capture handle is to allow the creation of a server-based communication channel between capturer and capturee, it seems best to simply use this channel to convey that information.

eladalon1983 commented 1 year ago

[Reordered some of the responses in the interest of readability; the first one hopefully makes it clear why.]

From the issue's description, I am not sure what exactly you are trying to solve, can you clarify this?

Yes, I would love to clarify:

Networks are imperfect, so encoded video has to make sacrifices.
Sometimes it is better to sacrifice frame-rate; sometimes resolution. It depends on the encoded content.
Capturing applications can make better decisions if they know what type of content they're capturing.
By amending Capture Handle with the proposal in the current issue, a captured application can help the capturing application make better decisions.
The user agent could sometimes help, but not always, because auto-detection of the captured content is imperfect; see below.

IIRC, this was discussed at an interim and there were feedback questioning the actual usefulness I do not remember the conclusion of this discussion though.

I don't remember anyone proving that this is not useful.

We have multiple teams inside of Google who are interested in using this.
Microsoft has expressed interest in this. (See @aboba's comment.)

If someone thinks this is NOT useful, the onus is on them to prove as much.

I can tell you that internally inside Google, some have questioned why auto-detection could not be used instead. My answer is that auto-detection is imperfect and can misfire (more below). The correct algorithm for a capturer-encoder should be:

If a suggestedContentHint has been set, use it. (It's possible to disregard if untrusted, but I'd just use it myself; malicious apps would just self-sabotage.)
No suggestedContentHint has been set, so let the UA use auto-detection.

User Agent is smart enough to optimise things

Optimizations can misfire. Consider:

Mixed content - text and video. Can the UA decide which is more important?
Transitions - can the UA guess what content is coming next? Can it understand that this frame will soon be replaced by more video?

Also, I am wondering whether this API shape is future proof. For instance, you might require different content hints if starting to crop capture.

I aim to make incremental progress. If you can propose a larger increment, I am happy to adopt it. Barring that, let's proceed with the best we can think of.

Given the main goal of capture handle is to allow the creation of a server-based communication channel between capturer and capturee

Citation needed.

it seems best to simply use this channel to convey that information.

Why incur the network delay?
Why force tight-coupling between capturer and capturee? With my proposal, Meet/Teams/Jitsi can all work equally well with Docs/Office/Wikipedia. Is that not a Good Thing TM?

dontcallmedom commented 1 year ago

This was discussed in the April 2022 meeting

eladalon1983 commented 1 year ago

This was discussed in the April 2022 meeting

Thanks. I see the following line in the minutes:

jib: I see agreement on the need, not yet on the API shape

So there was mostly agreement on usefulness, @youennf.

youennf commented 1 year ago

To be clear, the idea of capturee trying to help capturer or UA with encoding seems fine. My questions are more related to whether/how this info gets exposed to/used by capturer.

Some thoughts:

If there is tight coupling between capturer and capturee, this API is not needed, or more precisely this is just a small optimization, so low in priority.
In the short term, content hint can already be provided as part of the handle value. This might not be perfect in terms of separation of concerns, but early adopters can use this approach today in Chrome. Let's take the time to do the best design we can.
In another short term, the UA could use that content hint automatically (at least in RTCPeerConnection).
If there is no tight coupling between capturer and capturee, how is capturer supposed to interpret capturee content hint? Should it trust it or not? Maybe capturee input is only valid in a given context (say encoder is VP8) but is not good for other contexts (say encoder is H264).
In a world of VideoFrames, it seems this hint could be exposed as a VideoFrame metadata.
This API is not scalable as it is. As I said, just providing one content hint might not be enough once region capture is there. For instance, maybe capturee will only provide a content hint that is meaningful after cropping is done but some capturers may not do cropping.
I wonder whether handle should be an object (structure clonable or something like that) instead of a string. This way, the handle could contain some structured information (including CropTarget, content hints and so on).

eladalon1983 commented 1 year ago

If there is tight coupling between capturer and capturee, this API is not needed, or more precisely this is just a small optimization, so low in priority.

I'd phrase it differently.

If there is NO tight coupling, then this API is the only way to accomplish hinting.
If there IS tight coupling, then this API is still VERY helpful, because:
- Instantenous hinting with no bandwidth overhead.
- No need for each capturer/capturee pair to reinvent a message type for the hint.

In the short term, content hint can already be provided as part of the handle value.

Only between tightly coupled apps, since the handle is not a structured field, so it won't be clear where the hint lies and where other information is stored. For example, one capturee could set it as "session: 142, hint: HINT" while another capturee sets it as "HINT", and the capturer might know neither of them. My proposed API solves this.

Let's take the time to do the best design we can.

Let's always do our best. Time-wise, how long should this take?

In another short term, the UA could use that content hint automatically (at least in RTCPeerConnection).

Whether UAs use such an optimization seems out of scope for our discussion, as Web-devs with a stake have already agreed that they need more than what automatic optimization can offer.

If there is no tight coupling between capturer and capturee, how is capturer supposed to interpret capturee content hint?

No coupling necessary because it's a structured field that can be passed directly into the track's contentHint field. (Debtable what to do if a capturee tries to set a value that's not a legal contentHint - throw, ignore-and-allow or ignore-and-no-op.) See slide 50.

Should it trust it or not?

Up to the capturer to decide if it should apply hints from trusted sources only. My proposal is to trust the capturee, because there is no incentive for the capturee to lie - they'd only be annoying their own users, which is not a good business model or attack vector. (I can foresee discussions of "Docs could use misleading hints that only Meet knows to ignore" and I just don't find them convincing. But if someone has such a concern, then let them ignore untrusted hints, and that's that.)

Maybe capturee input is only valid in a given context (say encoder is VP8) but is not good for other contexts (say encoder is H264).

I don't think such hints currently exist. Or do you want to file a bug against the MST Content Hint working draft? I see @alvestrand is an editor.
Suppose these exist, or will exist in the future - then we'll find a structured way to provide hint-per-codec.

In a world of VideoFrames, it seems this hint could be exposed as a VideoFrame metadata.

Capture Handle specifies events already. These are absolutely necessary, because the captured tab can be navigated. Let's take advantage of that mechanism rather than reinvent it in a new context.

This API is not scalable as it is.

Do you have a better suggestion?

For instance, maybe capturee will only provide a content hint that is meaningful after cropping is done but some capturers may not do cropping.

APIs can be misused. If you suggest a fool-proof API, I'll be happy to adopt it. Otherwise, I don't think "this is not 120% perfect" is a reason to avoid progress.

I wonder whether handle should be an object (structure clonable or something like that) instead of a string. This way, the handle could contain some structured information (including CropTarget, content hints and so on).

Adding structure is precisely what this proposal is all about.

w3c / mediacapture-handle

[Identity][Enhancement] Expose contentHint #35