Why expose produceCropTarget at MediaDevices level?

youennf commented 2 years ago

Looking at the algorithm, it seems the same CropTarget object will be created if produceCropTarget is called several times on the same element. That would probably be specified by adding a CropTarget slot on HTMLElement directly. This would also solve probably the cloning element cropTarget issue.

The second question is whether below API might not be better suited:

partial interface HTMLElement {
  readonly attribute Promise<CropTarget> cropTarget;
}

I am also wondering why we even need a promise there. It seems implementations could produce a cropTarget synchronously without any IPC, which would end up with:

partial interface HTMLElement {
  readonly attribute CropTarget cropTarget;
}

eladalon1983 commented 2 years ago

That would probably be specified by adding a CropTarget slot on HTMLElement directly.

Exposing the interface on HTMLElement is possible, but I think exposing on something more media-related or RTC-related gives better encapsulation. The vast majority of HTMLElements do not end up serving as a crop-target. It's best of both the controls as well as their documentation are more localized to where they're relevant.

This would also solve probably the cloning element cropTarget issue.

I don't think so, but possibly I am missing your point. As I see it, there are two "cloning issues" at play:

Cloning of the target. A clone of an HTMLElement is a distinct HTMLElement, and should have a new CropTarget. This is equally easy to express with either approach (MediaSession.produceCropTarget vs. HTMLElement.cropTarget). In terms of communicating the selected approach, I think it's neater to express with produceCropTarget than with the alternative, but possibly this is subjective.
Cloning of the track. I believe this is orthogonal.

I am also wondering why we even need a promise there.

The intention is:

Allow produceCropTarget() to execute quickly for applications that don't immediately need the result. (E.g. cache it for later, in case gDM is called.) Allow JS execution to proceed immediately without blocking on IPC message-and-response.
Allow implementations flexibility (explained below).

I'm using the working assumption that we agree on no1 (let's discuss oherwise) and moving to discuss no2. Consider the example of Chrome. Returning a Promise gives Chrome the flexibility to produce the CropTargets' underlying ID in what we call the "browser process". This means that when cropTo(x) is called, we can validate x in the (trusted) browser process, check that it's a valid CropTarget (always necessary), and that the caller is allowed to cropTo it (relevant while we limit the spec to self-capture; orthogonal issue). JS execution proceeds immediately after produceCropTarget(), but an IPC message is sent to the browser process, and the Promise is resolved when the response IPC comes from the browser process.

Using a Promise also makes it easy to convince ourselves that Chrome's implementation suffers no hidden race conditions. Consider if a Promise were not employed in this scenario:

Cross-origin documents D1 and D2 cohabitate in a tab. Each documents has its own associated process.
D1 produces a CropTarget. D1 immediately gets a token before any IPC happens, but some IPC would have to complete before all relevant processes hear of this CropTarget.
(For the sake of argument, let's assume that CropTarget is both transferable as well as serializable.)
D1 sends the CropTarget to D2. Maybe using postMessage, maybe over the network.
D2 tries to use CropTarget.

In this scenario, it is required that the CropTarget D1 minted would be known to the cropTo() code-paths before D2 tries to call cropTo(), or that the cropTo() code-paths be robust to such reordering. This can be hard to reason about. But if we return a Promise, then we get no1 immediately, and we can resolve the Promise at our leisure, when it's patently obvious that the CropTarget has been fully propagated.

Note that the Promise gives flexibility. If Safari's implementation can convince itself there is no race even if the CropTarget is produced immediately, then Safari can return an already-resolved Promise.

Editorial: To make this thread more accessible, could you please edit the original post and add ```webidl before the code snippets before and ``` after them?

youennf commented 2 years ago

exposing on something more media-related or RTC-related gives better encapsulation.

Partial interfaces seem sufficient to me. Tieing produceCropTarget to MediaDevices creates some edge cases we should not have to deal with. For instance, what happens in the case of calling produceCropTarget on a detached iframe for a HTMLElement which is not part of the detached iframe?

Calling twice produceCropTarget on the same element is expected to produce the same CropTarget, which might be surprising given it is a method. On the other hand, this is by definition what is expected of attributes so it seems best to model this API as an attribute (basically as a slot in HTMLElement).

As of promise attribute vs. sync attribute, from what I read, using promises here is to accommodate a particular browser implementation. In general, we reserve promises to asynchronous operations, like querying hardware, querying centralised resources, querying user input. This is clearly not the case here, CropTarget is nothing more than a unique ID. It does not seem hard to make it work synchronously without any race condition.

Editorial: To make this thread more accessible, could you please edit the original post and add webidl before the code snippets before and after them?

Done

eladalon1983 commented 2 years ago

How common is it for specs to plug additional attributes directly onto HTMLElement? Checking the Chromium codebase just because that's the quickest way for me, it seems like it's exceedingly rare. I expect that the Chromium implementation is a good approximation of what W3C specifications do, but if my hack at getting this information was misguided, please let me know.

As for extending HTMLElement, that's of course quite common - but that's not what you're proposing. :-)

Btw, is it actually possible to expose a new field on all HTMLElements without incurring any cost? I'm not sufficiently familiar with how this is implemented in Chrome, let alone in other browsers, but I'd expect a non-zero cost for this exposure even if the field is unset until it's first read. (Possibly as small as increasing a lookup table, which might only happen when the Nth new field is added. I am not sure. But probably it's not zero-cost...?)

For instance, what happens in the case of calling produceCropTarget on a detached iframe for a HTMLElement which is not part of the detached iframe?

Could you please clarify how this edge-case would be problematic? The way I see it, if a given context can get a reference to the object in order to supply it as a parameter to produceCropTarget(), then that's all that matters.

Calling twice produceCropTarget on the same element is expected to produce the same CropTarget, which might be surprising given it is a method.

I don't think that's surprising. We can also reduce the (IMHO already low) surprise factor by renaming to getAsCropTarget or something similar. But I think "produce" is good enough.

using promises here is to accommodate a particular browser implementation

Politely disagree. It's done to afford all implementations the flexibility to implement efficiently and simply. If some implementations don't require this flexibility, I think it wouldn't harm them either - simply return the Promise pre-resolved.

youennf commented 2 years ago

How common is it for specs to plug additional attributes directly onto HTMLElement?

I don't really know, it probably depends how much HTMLElement is actually used in APIs. Infrastructure and implementations have prime support for that and some limited specs do that.

is it actually possible to expose a new field on all HTMLElements without incurring any cost?

I am not exactly sure which cost you are talking about. Is it that HTMLElement would get bigger in memory? If so, implementations have the flexibility to either use a new member in HTMLElement, a separate map<element, region> or any other strategy.

Could you please clarify how this edge-case would be problematic? The way I see it, if a given context can get a reference to the object in order to supply it as a parameter to produceCropTarget(), then that's all that matters.

It would need to be tested but if frame is detached, chances are the promise will reject. It does not make real sense to tie HTMLElement with an unrelated construct. For instance requestFullScreen is at Element level, which begs the additional question whether CropTarget should be Element or HTMLElement. Element might make more sense. I'll file a separate issue.

It's done to afford all implementations the flexibility to implement efficiently and simply.

There is a tradeoff with promises, we should use them when/if they are needed, which does not seem to apply here. Taking your previous example about potential race conditions, let's say we have A the capturee, B the capturer, C the entity doing the actual cropping, all living in 3 processes. Your plan seem to be:

A->C->A to register a surface identifier from a HTMLElement identifier (asynchronous)
A creates a region from surface identifier
A->B to send the capture region
B->C to ask for cropping (asynchronsous)
C->B to say cropping is enabled.

Step 1 & 2 can be replaced by:

A creates a region from HTMLElement identifier (synchronous) Then 5 will be replaced by 5a. C->A->C to identify surface identifier from HTMLELement identifier (B request identifies unambiguously A's process). 5b. C->B to say cropping is ongoing In practice, C->A->C can be optimised to not be needed.

eladalon1983 commented 2 years ago

There is a tradeoff with promises, we should use them when/if they are needed, which does not seem to apply here.

It makes it much easier for Chrome to deliver a more efficient implementation, without costing anything for other browser vendors (return a pre-resolved Promise) or Web developers (await cropTo). The tradeoff seems favorable to me.

let's say we have A the capturee, B the capturer, C the entity doing the actual cropping

Who is C in this scenario? If A in the document that holds the crop-target (div, iframe, etc.), then it will produce the CropTarget (the token) and send it to document B, who holds the track. B will call cropTo directly. I don't follow where C fits into the picture, except possibly as a conduit for messages which can be neglected¹. What am I missing?

-- [1] I'm using "neglect" here as one does in physics, e.g. when neglecting air resistance or friction.

youennf commented 2 years ago

without costing anything for other browser

A promise has a cost for web developers and for devices executing them. We should not overpromisify APIs.

Who is C in this scenario?

C is the process producing the screen frames and is the one that needs to identify the actual display layer the element refers to. My understanding is that this process is separated from A and B.

B will call cropTo directly.

cropTo is already an asynchronous operation and can handle the resolution of CropTarget.

eladalon1983 commented 2 years ago

(I am sure Youenn knows, but if others want to join the conversation, one line of background: In Chrome, applications run in a "render process", and there is a "browser process" coordinating them. Cross-origin applications run in different processes.)

Back to the produceCropTarget Promise. I think the implementation in Chrome, and its constraints, might have analogues in other browsers. Let's examine it.

A CropTarget is produced in one document (D1), then used by another document (D2) in the same browsing context.
There could be many more cross-origin documents in the browsing context, and many processes.
When D2 calls cropTo(X), the UA has to validate that X is a valid CropTarget.
It is undesirable to query all documents and check if any of them have produced X. Rather, the browser process holds a central repository of valid CropTargets and their associated browsing context. When cropTo() is called, this repository is consulted.
- If the CropTarget includes the origin, this can be simplified, but it still requires an RTT of IPCs, D1->BROWSER->D2->BROWSER->D1. An insufficient improvement.
That means that produceCropTarget inserts into that BP-based repository.
That also means that, if we don't impose a cap on number of CropTargets, an application can directly affect the memory consumption of the browser process. That is undesirable.
So there is a cap on the number of CropTargets a browsing context could have.
That means that when attempting to produce a CropTarget, there has to be IPC with the browser process before we know that we're in the happy case, and a CropTarget can be produced.
- Side-note: For security's sake, it's also more obvious that things are safe, if the underlying token for the CropTarget, is assigned by the trusted browser process, rather than communicated to it from the render process. For example, it becomes trivial to guarantee use a low-bit-count token behind the scenes, guarantee no collisions, and thus have a simple and efficient implementation in the rendering pipeline.
That means that produceCropTarget has to either return a Promise (which is resolved after IPC with the browser process), or it has to block on this IPC.
Blocking JS applications on IPC is something Chrome aims to avoid.

As mentioned, Chrome's implementation problems are Chrome's problems, but (a) I suspect other implementers will have similar issues and (b) the cost of resolving these issues is negligible. Returning a Promise gives implementers flexibility.

youennf commented 2 years ago

That means that when attempting to produce a CropTarget, there has to be IPC with the browser process

That is where we differ in the analysis.

Rather, the browser process holds a central repository of valid CropTargets and their associated browsing context.

The central repository is an implementation choice that is a potential source of issues (memory for instance as you pointed out), it seems it should be avoided if possible. If not possible, the spec should be clear about this. Maybe it should describe the central repository and how it is supposed to be used. WebLock is a good example of that and using promises there makes sense.

In our particular case, the spec does not define any such central repository. This central repository can be avoided by delaying the identification of the display surface to cropTo time. For that purpose, CropTarget needs to contain enough information, like uniquely identifying D1 (maybe using https://html.spec.whatwg.org/multipage/webappapis.html#concept-environment-id).

D2 will transmit this information to browser process at the time of cropping. Browser process may want to validate the cropping information by matching it with what it knows from the MediaStreamTrack source (like the list of the documents the source is capturing). Browser process might instead (or in addition to the previous validation) want to further validate or get additional information directly from D1's process, using asynchronous IPC. This is fine since this is part of cropTo which is an asynchronous operation.

The spec is light in how/when cropTo is supposed to validate crop targets, I'll file a separate issue. Also this issue was mostly about MediaDevices or Element and it is heavily going in Promise or not. Maybe we should have a specific issue for discussing whether promising or not.

eladalon1983 commented 2 years ago

A central repository is an implementation detail. The spec should not accidentally disallow reasonable implementations (hence the Promise). The spec is not concerned with such details beyond that point.

It's undesirable to put in the token more information than strictly necessary. By ensuring that CropTargets do not encode information about their source, we can provide better guarantees about the safety of the entire mechanism. (Please note - information not exposed to JS, but potentially available to a malicious application which has taken over the process, is also undesirable. Information not encoded in the render process cannot leaked to a malicious application.)

Also this issue was mostly about MediaDevices or Element and it is heavily going in Promise or not. Maybe we should have a specific issue for discussing whether promising or not.

If my current comment has not been sufficient to convince you, you're welcome to continue the discussion in a spin-off issue.

I wonder if anyone else (@jan-ivar, @aboba, @alvestrand) has an opinion to voice about exposing as MediaDevices.produceCropTarget() vs. exposing as Element.cropTarget().

youennf commented 2 years ago

you're welcome to continue the discussion in a spin-off issue.

I have done that in https://github.com/w3c/mediacapture-region/issues/17. My understanding is that the issues you are mentioning have been solved for other objects but there may well be something specific to CropTarget, let's find out in the new issue.

eladalon1983 commented 2 years ago

Thanks for spinning off.

Would the following be a good summary of the discussion that is remaining in this thread?

@eladalon1983 prefers exposing MediaDevices.produceCropTarget(Element), citing:
- Better encapsulation, as CropTarget production is media/RTC-related, as is MediaDevices; Element is very general.
- In contexts where MediaDevices is missing (eg HTTP), this API would not be exposed, which is preferable.
- If we agree (other thread) that we need a Promise, then a method is more natural than an attribute.
@youennf prefers exposing Element.cropTarget().
- (I don't want to write "citing X" as I do not feel I can do justice to your position this time. Please help me out here.)

youennf commented 2 years ago

Better encapsulation, as CropTarget production is media/RTC-related, as is MediaDevices; Element is very general.

requestFullScreen API attached to Element directly is well encapsulated through partial interface.

In contexts where MediaDevices is missing (eg HTTP), this API would not be exposed, which is preferable.

SecureContext can be applied to methods as well as interfaces so this seems orthogonal. With mixed context allowed, one could think of having a non secure context generate CropTarget for its SecureContext parent.

If we agree (other thread) that we need a Promise, then a method is more natural than an attribute.

Using promises is largely orthogonal in that case. The question is more whether for a given element, you get the same CropTarget object (current spec behavior) or not. With attributes, you get the same CropTarget calling get after get, this is an expected behaviour. With methods, you would get different promises but the same fulfilled JS object. Existing APIs that have this model tend to my knowledge to prefer attributes in that case (see FetchEvent, WHATWG Streams, WebAnimation as example).

eladalon1983 commented 2 years ago

Unless we've missed a good argument in either direction, it seems to me like the decision boils down to a matter of preference.

Wdyt of this characterization?
If that is correct, how do you suggest we resolve? Shall we ask the rest of the WG to voice their own preference?

youennf commented 2 years ago

Wdyt of this characterization?

I don't think it is solely a matter of preference. Using an attribute seems more consistent with existing web APIs, at least that is my current evaluation. We should try to reach consensus on validating/invalidating this evaluation. Also, using mediaDevices exhibits some suboptimal behaviour in at least one edge case (the detached iframe case as discussed before).

If that is correct, how do you suggest we resolve? Shall we ask the rest of the WG to voice their own preference?

Getting WG input is always useful, too late for next meeting though probably.

jan-ivar commented 2 years ago

@youennf prefers exposing Element.cropTarget().

@eladalon1983 AFAICT Youenn prefers Element.cropTarget, not Element.cropTarget().

I think he makes a good case for why an attribute is most idiomatic given the properties we want. However, there are a LOT of Elements compared to the rarity of needing a cropTarget from one — 0.005% of pageloads use getDisplayMedia and only a fraction of those will likely be interested cropping. So this gives me pause.

Presumably, the cropTarget would be instantiated in the getter if it doesn't exist, rather than at Element construction time, but this getter might be implicitly called by code libraries or even browser tools interrogating the Element (e.g. dumped to web console).

That said, there's precedent for rarely used features in Element.requestFullscreen() which is a method, so maybe Element.cropTarget() would be the right tradeoff?

cc @annevk Thoughts on this?

youennf commented 2 years ago

That said, there's precedent for rarely used features in Element.requestFullscreen() which is a method

Element.requestFullScreen() is rightfully a method, it returns a different promise every time it is called. Or are you thinking of some different APIs?

FWIW, a cropTarget when created should be nothing more than a weak reference to its element (when being transferred or when used in cropTo is when additional steps might be executed). Hence why I do not see accidental code libraries CropTarget creation as an issue.

eladalon1983 commented 2 years ago

FWIW, a cropTarget when created should be nothing more than a weak reference to its element (when being transferred or when used in cropTo is when additional steps might be executed). Hence why I do not see accidental code libraries CropTarget creation as an issue.

Triggering accidental lazy creation of CropTarget-s for all Elements in the DOM is patently undesirable. First, for the memory it consumes; that much should be self-evident. Second, for the side-effect; elaboration follows.

Let's define as "active" a CropTarget which is being used - that is, there is some track which was cropTo()-ed with that CropTarget. An "active" CropTarget has to be tracked along the rendering pipeline in every frame, which requires some work from the UA. Unless cropTo(X) has been called on some X, it is not strictly necessary to perform this work of tracking X's location in every frame. But avoiding the work until cropTo() is called, means a longer delay until the first cropped frame is delivered (and hence the cropTo Promise resolved). For that reason, it is reasonable (but not required) to implement CropTarget as immediately "active" - as soon as it's produced, before cropTo is called. This makes cropTo resolve its Promise faster, but it also means that minting the token has a non-zero cost. Even if we expose on Element, we should expose as a method, so as to discourage unintended production of a CropTarget for every single Element.

youennf commented 2 years ago

It would be interesting to understand whether libraries actually go through all elements and get all their attributes. I suspect that this would be a perf issue w/o CropTarget.

Thinking a bit more though, I don't see why the spec mandates to expose a single CropTarget per Element. It makes implementations a bit harder without any real benefit. Maybe this is a left over from the CropID initial version? Removing this constraint from the spec and making cropTarget a method (probably renaming it to getXYZ as well) seems like a good idea to me.

eladalon1983 commented 2 years ago

Thinking a bit more though, I don't see why the spec mandates to expose a single CropTarget per Element. It makes implementations a bit harder without any real benefit. Maybe this is a left over from the CropID initial version? Removing this constraint from the spec and making cropTarget a method (probably renaming it to getXYZ as well) seems like a good idea to me.

Could you please explain why this is a problem for implementers?

annevk commented 2 years ago

Generally the "iterate over all members" pattern affects Window, Navigator, and Document. I wouldn't expect it to matter for Element. (Note that the syntax you all are using is rather confusing as it seems you are talking about an instance getter/method whereas it very much appears like a static getter/method.)

alvestrand commented 2 years ago

I don't see a big advantage in exposing produceCropTarget() on a plethora of objects. Having the function be in one place makes much more sense to my impression of understandability.

eladalon1983 commented 2 years ago

Thinking a bit more though, I don't see why the spec mandates to expose a single CropTarget per Element. It makes implementations a bit harder without any real benefit. Maybe this is a left over from the CropID initial version? Removing this constraint from the spec and making cropTarget a method (probably renaming it to getXYZ as well) seems like a good idea to me.

Could you please explain why this is a problem for implementers?

To document some out-of-band discussions, I am OK with the spec mandating returning an equivalent CropTarget, which is a less stringent requirement than "the same" CropTarget.

youennf commented 2 years ago

I don't see a big advantage in exposing produceCropTarget() on a plethora of objects.

It is not on a plethora of objects, it is only either in MediaDevices prototype or HTMLElement prototype.

We discussed with @eladalon1983 some of the reasons why element was a better location during last editor's meeting. At that time, I felt there was some consensus towards element. Some reasons below:

MediaDevices is SecureContext, Element is not. It seems ok for a non secure document to be able to create a CropTarget. This might be handy in the future or in mixed contexts today. A context that does not need MediaDevices might still find it useful to create and transfer CropTargets.
MediaDevices is tied to its navigator hence to a specific Document. Elements can be transferred from one document to another. If we were to use MediaDevices, we would need to handle the case of creating CropTargets for elements which are not tied to the same document as the MediaDevices instances. There does not seem to be any reason to try going in those edge cases, that can get even weirder if using promises.
The MediaDevices instance is not bringing anything to the creation of CropTarget, contrary to getUserMedia et al. A static MediaDevices.produceCropTarget would work equally well. Why trying to tie the algorithm to an object that is unused. In programming language, we usually try to remove unused parameters.
Feature detection is more easy if tied to the element than with MediaDevices: 'cropTarget' in Element.prototype in one case. With MediaDevices, you would need to actually call the produceCropTarget API to get the same result.
There is no real difference between the two versions in terms of documentation/implementation, the separation of concerns is easy to do in both cases. Given we have partial interfaces, we would have Element+CropTarget.idl for instance.

eladalon1983 commented 2 years ago

It is not on a plethora of objects, it is only either in MediaDevices prototype or HTMLElement prototype.

I think what @alvestrand was referring to, is that (i) this would now be exposed on many different sub-types of HTMLElement, as well (ii) exposed on multiple instances. If that's what he meant, I agree.

MediaDevices is SecureContext, Element is not. It seems ok for a non secure document to be able to create a CropTarget. This might be handy in the future or in mixed contexts today. A context that does not need MediaDevices might still find it useful to create and transfer CropTargets.

That was an interesting case, but unclear to me how important it is to keep supporting non-secure contexts. Would they be able to postMessage the CropTarget, for instance? Would it be possible to trust the message in which they do so?

Elements can be transferred from one document to another.

What's the mechanism for that?

If we were to use MediaDevices, we would need to handle the case of creating CropTargets for elements which are not tied to the same document as the MediaDevices instances.

I don't think there has to be a connection between where the CropTarget-minting-API is exposed, to where the elements are, let alone where the CropTargets are.

A static MediaDevices.produceCropTarget would work equally well.

Could you clarify what you mean here? I couldn't parse this in a manner consistent with the rest of the message. I mean, if there is a way to define produceCropTarget as exposed on the class MediaDevices, but not tied to any object, then that's fine by me...

With MediaDevices, you would need to actually call the produceCropTarget API to get the same result.

Why would I need to call the function? Why can't I just check for its existence using !!navigator.mediaDevices.produceCropId?

lghall commented 2 years ago

I represent a team at Google that has been using the Region Capture API internally. We prefer mediaDevices.produceCropId (vs Element.produceCropId()) because it's easier to feature-detect whether the API is available without needing an Element to detect the method presence. Hope that helps!

youennf commented 2 years ago

Could you clarify what you mean here?

Which property of the mediaDevices instance are you using in the CropTarget production algorithm, except for its environment: none. You could easily change it to something like target = await MediaDevice.produceCropId(element).

Why would I need to call the function? Why can't I just check for its existence using !!navigator.mediaDevices.produceCropId?

I was not clear in my previous post: fine grained feature detection can be used to determine whether there is support for Element CropTarget or only HTMLElement CropTarget (say some browsers ship HTMLElement and add Element support as a follow-up). If we use an Element/HTMLElement method, you just have to do !!Element.prototype.XYZ/HTMLElement.prototype.XYZ, this is super easy. With mediaDevices.produceCropTarget, you need to call navigator.mediaDevices.produceCropTarget with a value that is an Element (and not an HTMLElement) to check whether it rejects and probably check the error is TypeError as well. This is not great.

We prefer mediaDevices.produceCropId (vs Element.produceCropId()) because it's easier to feature-detect whether the API is available without needing an Element to detect the method presence. Hope that helps!

In terms of feature detection, an Element/HTMLElement method is better and easier, as shown above. There is no need to create an element, you can just do Element.prototype.XYZ.

eladalon1983 commented 2 years ago

say some browsers ship HTMLElement and add Element support as a follow-up

Chrome has already agreed to Safari's suggestion of exposing on Element. So I don't think exposing on anything but Element is a likely outcome. Or have you changed your mind on that?

!!Element.prototype.XYZ

I think it's unlikely that any browser implementation would expose individually for various sub-types of Element, given how many of them there are. (Also, are we planning to avoid supporting/exposing for any sub-type? Chrome is not.)

you need to call navigator.mediaDevices.produceCropTarget with a value that is an Element (and not an HTMLElement) to check whether it rejects

I am not aware of a plan to reject this sub-type or that. Feature-detection is global, regardless of the Element type an application expects to use. Even if we do expose this API on Element, what I expect is !!Element.method, and not !!ElementVerySpecificSubtype.method.

--

To summarize

I think your arguments here are:
- If exposing on Element, then non-conformant implementations would be less problematic, because a sophisticated application could, if it is aware of non-conformance, check for exposure on the specific sub-type it intends to call produceCropTarget() on.
My arguments are:
- Robustness to non-conformance is an argument, but it's not always a very strong argument. Are you aware of upcoming non-conformant implementations? Ones that don't work for any given Element, that is.
- My original argument still stands - in favor of exposing an API adjacently to other APIs it interacts with, that is.

jan-ivar commented 2 years ago

We still need consensus on API shape for cropping capture of own viewport (tab) to an element, so let me try this thought experiment, and solicit broader API design input. cc @annevk , @domenic

What's the simplest API for web developers? We agree in https://github.com/w3c/mediacapture-region/issues/19 that when the element is in the same document, the answer is:

const element = document.getElementById("foo");
await track.cropTo(element);

So that's the ideal API. What's getting in the way of that: the element to crop to may be in a different-origin document, and we can't postMessage element, yet nothing short of the element will do (since the element may visually move):

// b.com/iframe.html
const element = document.getElementById("foo");
parent.postMessage(element, "*"); // won't work!

// a.com/main.html
window.onmessage = async ({data: element}) => await track.cropTo(element);

We've agreed the solution is to postMessage a weak reference to the element instead. I think the simplest API for that is:

// b.com/iframe.html
const element = document.getElementById("foo");
parent.postMessage(element.weakRef, "*");

// a.com/main.html
window.onmessage = async ({data: elementWeakRef}) => await track.cropTo(elementWeakRef);

...but instead we have:

// b.com/iframe.html
const element = document.getElementById("foo");
parent.postMessage(await navigator.mediaDevices.produceCropTarget(element), "*");

// a.com/main.html
window.onmessage = async ({data: cropTarget}) => await track.cropTo(cropTarget);

...which seems more complex than it needs to be:

An irrelevant object is involved (mediaDevices)
An asynchronous method is used where a synchronous attribute should suffice w3c/mediacapture-screen-share#17
Naming of a potentially useful weak reference concept is (too?) narrowly scoped to the use case.

Thoughts?

domenic commented 2 years ago

(Kudos to @jan-ivar for a very nice summary of the issue so that I didn't have to read the whole thread after being pinged. Or at least, his summary convinced me I don't; hopefully that was correct ^_^.)

So to be clear, by "weak reference" here you mean something that has the characteristics:

Uniquely refers to the element in question across process boundaries
Does not retain a reference to the element across process boundaries

Back when I was last involved in discussing this API, I think I suggested a weak reference-type solution specifically based on using the string element.id. This mostly meets the above properties, under the assumption that developers aren't creating multiple elements in a single page with the same ID. (This is a fair assumption IMO; if developers do the bad thing, then just pick the first such element.) It also has the great benefit of already existing.

Note that it obtains the second property via the tuple (event.source, element.id), not just via element.id alone, so I guess you'd have to pass that into cropTo().

alvestrand commented 2 years ago

Using element.id would requre the capturer and the capturee to share information about event.source, which it otherwise does not need to know (as far as I can tell).

It also exposes information enough to access the element using any API that can use element.id, including the normal element-finding functions, to any scope that can access the context by Javascript (extensions?)

That's what we get from reusing an existing ID - is this an advantage or a disadvantage?

youennf commented 2 years ago

creating multiple elements in a single page with the same ID.

What about a single page with multiple third party iframes?

Note that it obtains the second property via the tuple (event.source, element.id)

I am not sure how event.source would actually work in all cases, for instance if source is a MessagePort, or if there is no source (post message to a dedicated worker) or in case of double post messaging.

To have full support, we would need to introduce a serialisable context weak reference and pass it along the ID. This does not seem too bad to me, especially if the context weak reference might have some use in other contexts.

Another thing to consider is that the ID can refer to one element at a given point in time and to another one later on. This might make dynamic cropping easier for web developers but UAs would need to be prepared for that.

is this an advantage or a disadvantage?

The current API does not allow to go back from element weak reference to element, which is okayish (application can pass along application specific data for that purpose) but is a bit sad conceptually. Adding that ability would be nice.

domenic commented 2 years ago

Using element.id would requre the capturer and the capturee to share information about event.source, which it otherwise does not need to know (as far as I can tell).

I mean, the information already exists and is passed anyway. That seems OK.

What about a single page with multiple third party iframes?

The ID is scoped to a single frame, so iframes do not interfere.

I am not sure how event.source would actually work in all cases, for instance if source is a MessagePort, or if there is no source (post message to a dedicated worker) or in case of double post messaging.

Yes, this is a fair point. You could make MessagePort work if the other side is a Window, and just say that if you get a message from a DedicatedWorker, or if the message passes through multiple windows, it doesn't work. But this starts to make it all less elegant.

As you say, to solve the problem in full generality you need some sort of serializable weak reference (or ID) for the original window itself. So if that kind of full generality is the goal then using a (new window weak reference concept, element's ID) tuple is redundant, and you could just do (new element+window weak reference concept).

On the question of "ID" vs. "weak reference": I think the distinguishing factor is that an ID is a string whereas a weak reference is some sort of opaque object or handle. Probably weak reference is the best default in general. It's just nice that elements already have a string ID so it's worth reusing that if possible.

My summary:

(Window/MessagePort-from-Window object, element ID string): does not work in full generality, but is nice and simple and reuses existing primitives
element+window weak reference opaque object: works in full generality. (This is basically @jan-ivar's proposal IIUC.)
Other options, e.g. (window weak reference, element ID string): IMO this is extra complexity for little gain, but it sounds like other people might be interested in some of the things the ID string could enable.

youennf commented 2 years ago

I would vote for a solution that works in full generality. I am ok with both options (element weak reference or element ID + window weak reference). @domenic, in terms of API shape, taking @jan-ivar proposal, do you think it makes sense for the following things?

the API to get the weak reference be synchronous
the API to be exposed in Element (and not MediaDevices)

domenic commented 2 years ago

Not really. It seems like absent non-media use cases and consumers, giving it a media-specific entrypoint makes more sense. And it seems reasonable that the process of generating a cross-process-friendly weak element reference might involve async machinery.

jan-ivar commented 2 years ago

(Window/MessagePort-from-Window object, element ID string): does not work in full generality, but is nice and simple and reuses existing primitives

Thanks @domenic, this is my winner. Avoids a new API & gets the id from iframe B to iframe A, which is all we need. KISS

// a.com/main.html
const {port1, port2} = new MessageChannel();
iframeA.contentWindow.postMessage(port1, '*', [port1]);
iframeB.contentWindow.postMessage(port2, '*', [port2]);

// b.com/iframeB.html
const element = document.getElementById("foo");
const {data: port} = await new Promise(r => window.onmessage = r);
port.postMessage(element.id);

// a.com/iframeA.html
const {data: port} = await new Promise(r => window.onmessage = r);
port.onmessage = async e => await track.cropTo(e.data.id, {source: e.source});

postMessage here is a means to an end, so solving full generality feels like scope-creep.

Another thing to consider is that the ID can refer to one element at a given point in time and to another one later on. This might make dynamic cropping easier ...

@youennf a live id feels like feature creep. Having await track.cropTo(id) resolve to a specific element seems POLA.

alvestrand commented 2 years ago

As I've iterated before, I object to the excessive generality of posting element.id between contexts.

I feel that the security, privacy and implementation considerations of this approach are wide-ranging and unexplored, and that fully evaluating how such an API can be used (or abused) is far beyond the scope of the current proposal.

jan-ivar commented 2 years ago

As I've iterated before, I object to the excessive generality of posting element.id between contexts.

JS can already post element ids between contexts, and is not up for question here, so objecting to that seems out of scope.

security, privacy and implementation considerations of ... such an API ...

@alvestrand Element id is an existing API, so I assume you mean cropTo here? Video cropping (whether from screen-share or camera) is already possible in the platform. So what exactly are the security and privacy considerations of more reliable cropping to an element?

eladalon1983 commented 2 years ago

JS can already post element ids between contexts, and is not up for question here

First, Element IDs are guessable.

Second, developers do not necessarily wish to expose the Element ID. (And randomizing an ID goes against the spirit of simplifying things for them.)

Third, that Element IDs and other strings can be posted in a message is not the issue.

Introducing an API that operate on a cross-document X, compels Web developers to post X cross-document. This is done with the implicit assumption that X can only be used cross-document for a specific set of mechanisms. An X that is generic by design is inherently suspect. A Web developer cannot safely post that, for they do not know what they are effectively permitting one year down the line. Note that it is NOT necessary for the future API be abusive or defective in any way; see further below.

It does not matter if X is a "weak reference" or "element.id + source" or anything else. If it is generic, it acts as a permission-token to allow remote execution of unknown type.

As a concrete albeit absurd example, what if a future mechanism is introduced, that allows a document to set hidden = true on a weak-ref? Did the Web developer posting X intend to give other documents permission to toggle that state? No. And why should a developer posting X in the future, wishing to permit toggling hidden, also effectively allow cropping? Did they intent that? No.

One may protest that my example is too absurd. Good! I'd welcome that. I'd then remind that, despite my repeated requests, no hypothetical future mechanism was presented, that would benefit of this generic cross-document-identification mechanism. So let's please examine such a concrete example. Barring that, I call the wisdom of this discussion into question.

Fourth, this discussion sweeps under the rug our long-running disagreement over produceCropTarget() being asynchronous.

domenic commented 2 years ago

I'm extremely surprised by these objections to using element IDs. I think using them for this purpose makes perfect sense and is the kind of thing they were designed for. Randomizing IDs is not against their spirit. Concrete absurd future hypothetical examples don't have any bearing on the actual existing element.id primitive. There are no security and privacy considerations for element.id; it's just a string.

eladalon1983 commented 2 years ago

I think using them for this purpose makes perfect sense and is the kind of thing they were designed for.

I am not aware of any other API that accepts the ID of an element from another document. Could you name one? Is that a common pattern that proves that element IDs were indeed designed for this?

Concrete absurd future hypothetical examples don't have any bearing on the actual existing element.id primitive. There are no security and privacy considerations for element.id; it's just a string.

(To simplify things, let's call those who argue here for the use of a general-purpose token The Generics, and those who argue for a single-purpose tokens The Singletons.)

First, I have used an absurd example precisely because the Generics have not yet provided an example of a future API that would benefit from a generic token. The burden of producing such an API is on them.

Second, assuming the Generics can produce such a hypothetical API, they should then also explain why it's desirable to use a generic token. Assuming there are multiple APIs that are unblocked by that token, the application loses the ability to only unblock one when it posts such a token. How is that preferable? (Hard to answer without a concrete example of an API though... Which the Generics should produce.)

jan-ivar commented 2 years ago

My assessment is that a mechanism to identify an element in another document already exists, thus the onus is on those who don't want to use it, to explain what a new API would add above and beyond the existing mechanism.

youennf commented 2 years ago

Here are some pros/cons for IDs. Pros:

No new API needed for some scenarios: just send the element ID (or capturer might know the ID in advance).

Cons:

To identify the remote document, relying on WindowProxy is good, relying on MessagePort is brittle/subject to race conditions: it can be transferred or closed.
The general case is not supported.

Things to further evaluate:

Do we need to support the general case? If so a new API might be needed, either now or as a folllow-up.
An element ID can change. Either capture should follow ID changes or we are ok living with potential races and inconsistencies (cropTo with the same ID on two cloned tracks leading to different frames).
Security implications of allowing an application that captures an iframe to crop without the iframe help. It is unclear to me what the potential issues are.

eladalon1983 commented 2 years ago

My assessment is that a mechanism to identify an element in another document already exists

@jan-ivar, you have still not shown any existing API using this purported pre-existing mechanism, so it's not at all clear to me that it exists. You have many more years of experience than me, so it's quite likely that you know something I don't here. So please do share.

You have also not mentioned a hypothetical new API that would use it. I am waiting for that too.

youennf commented 2 years ago

The most important thing to evaluate for me is whether we want to support the general case, in particular double postMessage scenarios (say I want to crop the track in a worker). It can be done by passing a MessagePort in the same message as the element ID or some new API seems needed.

And it seems reasonable that the process of generating a cross-process-friendly weak element reference might involve async machinery.

@domenic, I am surprised here. MessagePort is a cross-process-friendly reference that is created/transferable synchronously. This can solve for instance the double postMessage scenario in a synchronous manner.

youennf commented 2 years ago

@jan-ivar, you have still not shown any existing API using this purported pre-existing mechanism, so it's not at all clear to me that it exists.

@eladalon1983, I am unclear what you are asking for. Element.id is a thing, postMessaging a string does exist. Web applications probably exchange Element.id strings (it should be easy to write a jsfiddle). Which API are you not clear about?

alvestrand commented 2 years ago

The "no new API" method suggested in https://github.com/w3c/mediacapture-region/issues/11#issuecomment-1123235239 (allowing the capturer to CropTo() to an ID that has not previously been sent to it) will allow a capturer to explore the inner structure of the captured page, without the cooperation of the page (by attempting CropTo() to IDs that are known to be in use by the captured "victim" application and seeing if it succeeds). This may reveal information that is otherwise not available to the capturer.

This is the kind of side effect without a security evaluation that I've been warning about when arguing in favor of a new ID scheme.

eladalon1983 commented 2 years ago

@jan-ivar, you have still not shown any existing API using this purported pre-existing mechanism, so it's not at all clear to me that it exists.

@eladalon1983, I am unclear what you are asking for. Element.id is a thing, postMessaging a string does exist. Web applications probably exchange Element.id strings (it should be easy to write a jsfiddle). Which API are you not clear about?

@youennf, note that the word "probably" appears in a key location in your message. And you're probably right - about APIs that applications expose and consume. But can anyone name an API that the browser exposes, that allows one document to affect action based on another document's Element IDs?

youennf commented 2 years ago

@eladalon1983, I am not sure what you are trying to point at. In any case, one scenario would be to use getElementById, something like: postMessage the ID from frame A to frame B. Then, when frame B wants to get info on the element (say its CSS class, its name...), frame B postMessages to frame A with the ID, frame A calls getElementById, gets the necessary bit of information and postMessages it back to frame B.

This is the kind of side effect without a security evaluation that I've been warning about when arguing in favor of a new ID scheme.

Right, I think this is the kind of things that would be good to validate. Note though that element.id is not sufficient, we need a reference to the document. In our case, the iframe was ok with being embedded into the page, maybe this is a good enough restriction? I guess we could enforce cropTo to take a combo Element.id + https://html.spec.whatwg.org/multipage/webappapis.html#concept-environment-id (already exposed in some APIs like service workers or web locks), so that guessing the Element.id is not sufficient, the iframe would have to explicitly transmit its own env ID. That would also solve the general case, as well as @eladalon1983 CropTarget stringification issue. That said, the first thing would be to identify what we are trying to protect from, which I am still not very clear about.

jan-ivar commented 2 years ago

... will allow a capturer to explore the inner structure of the captured page

Thanks @alvestrand for explaining. Being able to deduce information never intended for the screen seems at best surprising both to end-users and web developers. Since id strings in theory can containing anything, this information could be part-guessable yet valuable I suppose, though in practice I think they only reveal the "structure", as you say. A valid item for the Cons column, I think.

At the same time, perhaps we should not give pages whose entire rendering output is captured a false sense of security — when is its structure not betrayed by its visuals? You mentioned an attacker would start from some "known" information about the victim site, so if this exploit can be polyfilled ontop of existing APIs, I'm not sure we have a (new) problem.

Also, with getViewportMedia (the long-term API here), we're only talking about pages that have opted into capture with Document-Policy: viewport-capture. Id-based cropping seems fine here.

In our case, the iframe was ok with being embedded into the page, maybe this is a good enough restriction?

With getDisplayMedia, today's "self-capture" is unsafe and can hopefully be deprecated eventually https://github.com/w3c/mediacapture-screen-share-extensions/issues/9. End-users are already taking much greater security risks here, by allowing capture of cross-origin rendering to sites that are neither site-isolated nor restricted to opt-in-only iframes. Users make faulty assumptions about those risks today: They don't know attackers can coax victim-iframes to render new information they'd never expect. So expectations about what can and can't be exposed seem meaningless here.

I'd like to get to a place where previous mistakes no longer inform new APIs.

eladalon1983 commented 2 years ago

I am not sure what you are trying to point at. In any case, one scenario would be to use getElementById, something like: postMessage the ID from frame A to frame B. Then, when frame B wants to get info on the element (say its CSS class, its name...), frame B postMessages to frame A with the ID, frame A calls getElementById, gets the necessary bit of information and postMessages it back to frame B.

There's a big difference between these two scenarios:

A document D1 exposes an ID and receives back a message reading {PLEASE: "invokeFunc4", param: id}. At this point, D1 gets to decide whether it wants invokeFunc4() on the element.
A document D1 exposes an ID to D2. This ID might be further shared with D3. Various browser-level APIs may now be invoked on the corresponding element, initiated by either D2 or D3 or any other document. Furthermore, it's unclear if so much as an event would be fired for D1 when that happens. Furthermore x2, the set of browser-exposed APIs that can be invoked on that D2 and D3 could invoke on D1's element - it is unclear if that set of browser-exposed APIs could change.

So I ask:

Is there a precedent for the second?
If not, why are we so eager to set this precedent? Do we have a second API in mind? Who will benefit?

w3c / mediacapture-region

Why expose produceCropTarget at MediaDevices level? #11