Open eladalon1983 opened 2 years ago
I think this could be useful, because the Content Hint might change over time. As a result, just knowing the captured application might not be sufficient. For example, let's say you are doing a slide presentation. Most of the presentation is slides with text, so the "text" content-hint is appropriate for those slides. However, in the middle of the presentation you include a slide with an image on it (e.g. a picture of a bird). Now the "detail" content-hint would be more appropriate. Or perhaps your slide presentation has an embedded video. Once you start to play the video, the "motion" content-hint would be appropriate. oncapturehandlechange would allow the capturer to obtain the Content-Hint as it changes.
Thank you, Bernard. @youennf and @jan-ivar, any opinions before I send a PR?
IIRC, this was discussed at an interim and there were feedback questioning the actual usefulness. I do not remember the conclusion of this discussion though.
If using WebRTC, there is no need for content hints to be exposed to capturer, User Agent is smart enough to optimise things. From the issue's description, I am not sure what exactly you are trying to solve, can you clarify this?
Also, I am wondering whether this API shape is future proof. For instance, you might require different content hints if starting to crop capture. Given the main goal of capture handle is to allow the creation of a server-based communication channel between capturer and capturee, it seems best to simply use this channel to convey that information.
[Reordered some of the responses in the interest of readability; the first one hopefully makes it clear why.]
From the issue's description, I am not sure what exactly you are trying to solve, can you clarify this?
Yes, I would love to clarify:
IIRC, this was discussed at an interim and there were feedback questioning the actual usefulness I do not remember the conclusion of this discussion though.
I don't remember anyone proving that this is not useful.
If someone thinks this is NOT useful, the onus is on them to prove as much.
I can tell you that internally inside Google, some have questioned why auto-detection could not be used instead. My answer is that auto-detection is imperfect and can misfire (more below). The correct algorithm for a capturer-encoder should be:
User Agent is smart enough to optimise things
Optimizations can misfire. Consider:
Also, I am wondering whether this API shape is future proof. For instance, you might require different content hints if starting to crop capture.
I aim to make incremental progress. If you can propose a larger increment, I am happy to adopt it. Barring that, let's proceed with the best we can think of.
Given the main goal of capture handle is to allow the creation of a server-based communication channel between capturer and capturee
it seems best to simply use this channel to convey that information.
This was discussed in the April 2022 meeting
This was discussed in the April 2022 meeting
Thanks. I see the following line in the minutes:
jib: I see agreement on the need, not yet on the API shape
So there was mostly agreement on usefulness, @youennf.
To be clear, the idea of capturee trying to help capturer or UA with encoding seems fine. My questions are more related to whether/how this info gets exposed to/used by capturer.
Some thoughts:
handle
should be an object (structure clonable or something like that) instead of a string. This way, the handle
could contain some structured information (including CropTarget, content hints and so on).
- If there is tight coupling between capturer and capturee, this API is not needed, or more precisely this is just a small optimization, so low in priority.
I'd phrase it differently.
- In the short term, content hint can already be provided as part of the handle value.
Only between tightly coupled apps, since the handle
is not a structured field, so it won't be clear where the hint lies and where other information is stored. For example, one capturee could set it as "session: 142, hint: HINT" while another capturee sets it as "
Let's take the time to do the best design we can.
Let's always do our best. Time-wise, how long should this take?
- In another short term, the UA could use that content hint automatically (at least in RTCPeerConnection).
Whether UAs use such an optimization seems out of scope for our discussion, as Web-devs with a stake have already agreed that they need more than what automatic optimization can offer.
- If there is no tight coupling between capturer and capturee, how is capturer supposed to interpret capturee content hint?
No coupling necessary because it's a structured field that can be passed directly into the track's contentHint
field. (Debtable what to do if a capturee tries to set a value that's not a legal contentHint - throw, ignore-and-allow or ignore-and-no-op.) See slide 50.
Should it trust it or not?
Up to the capturer to decide if it should apply hints from trusted sources only. My proposal is to trust the capturee, because there is no incentive for the capturee to lie - they'd only be annoying their own users, which is not a good business model or attack vector. (I can foresee discussions of "Docs could use misleading hints that only Meet knows to ignore" and I just don't find them convincing. But if someone has such a concern, then let them ignore untrusted hints, and that's that.)
Maybe capturee input is only valid in a given context (say encoder is VP8) but is not good for other contexts (say encoder is H264).
- In a world of VideoFrames, it seems this hint could be exposed as a VideoFrame metadata.
Capture Handle specifies events already. These are absolutely necessary, because the captured tab can be navigated. Let's take advantage of that mechanism rather than reinvent it in a new context.
- This API is not scalable as it is.
Do you have a better suggestion?
For instance, maybe capturee will only provide a content hint that is meaningful after cropping is done but some capturers may not do cropping.
APIs can be misused. If you suggest a fool-proof API, I'll be happy to adopt it. Otherwise, I don't think "this is not 120% perfect" is a reason to avoid progress.
- I wonder whether handle should be an object (structure clonable or something like that) instead of a string. This way, the handle could contain some structured information (including CropTarget, content hints and so on).
Adding structure is precisely what this proposal is all about.
Content Hints allow an application to instruct the encoder what type of content it could expect, and therefore might be the best type of encoding. It is up to the capturing application to deliver the content hint to the encoder. But it is the captured application which has this information. It would be good if there were a standard way for the capturee to suggest a content-hint to the capturer. If the capturer wishes, it can then use that suggestion.
That is:
setCaptureHandleConfig
with aconfig
that includes two fields,suggestedContentHint
.mst.contentHint
based on this. (Probably to the exact value suggested, but not necessarily.)Suggested API:
And the algorithm for
setCaptureHandleConfig
can validate that the hints must be valid hints. (Doesn't have to - open for discussion).Then:
One thing we'll be adding here, is that we'll expose
captureHandle
on all track returned bygetDisplayMedia
. They'll be identical in some fields (origin, handle) and distinct in others (suggestedContentHint).