w3c / mediacapture-screen-share-extensions

Other
1 stars 0 forks source link

Tab capture control #13

Open youennf opened 6 days ago

youennf commented 6 days ago

This issue is related to the https://screen-share.github.io/captured-surface-control/ proposal, based on my reading of the spec and experimenting with https://captured-surface-control.glitch.me/ to control google slides and google map.

First, the use case is fine. The current state of the prototype and spec do not seem sufficient for a consistent user experience though. This somehow casts doubts on the approach (or at least we should understand what needs to happen next to make the user experience good).

The spec/implementation is focusing on specific inputs (zoom and scrolling). Focusing on a more general principle may be beneficial. Some thoughts that I hope can help:

A few additional thoughts:

Based on this, I would look at an API along those lines:

partial interface HTMLVideoElement {
    attribute boolean enableGestureForwarding;
};

And maybe, a secondary optional API to allow web page to know what is going on:

partial interface HTMLVideoElement {
    readonly attribute boolean gestureForwarding;
    attribute EventHandler ongestureforwardingchange;
};

This kind of API shape adds some flexibility in how much UA wants to forward or not user gesture (say user enables forwarding and in the middle of the call disables it) and unties the API from permissions.

youennf commented 5 days ago

@eladalon1983 @jan-ivar, this above issue should complete the action item I took during the WG's meeting. I guess it is ok to split this issue or move some of the discussion elsewhere (I can see some relevant discussion happening in the screen share CG GitHub repo). I'd prefer the discussion to happen in this repo though if possible/useful.

eladalon1983 commented 5 days ago

Thank you for this in-depth feedback!

  • web pages want to opt-in to a mode where the captured preview could be interacted sort of like an embedded iframe but there are security principles that should restrict the level of interaction that is possible.

Just a quick mention that this interpretation might be correct for scroll-forwarding, but is incorrect for zoom-controls. More on this below.

  • One principle could be that UA is reponsible to forward some of the user gestures from the preview to the captured web page. The decision heuristics may be UA specific.

I am supportive of empowering user agents to employ heuristics if they wish; if you want explicit text in the spec, I can add it. But we should retain an explicit way for applications to request the behaviors introduced in this specification. I cannot imagine a perfect heuristic; they are always liable to occasionally (a) fire when undesired and (b) fail to fire when desired.

  • The current design seems to introduce one API per gesture. It is not clear to me that this level of flexibility is beneficial, this is worth discussing.

If you want to reshape the API as follows, for future-proofing...

dictionary ForwardedGestures {
  boolean wheel = false;
  // Future-proof for pinch etc.
};

partial interface CaptureController {
  Promise<undefined> forwardGestures(
      HTMLElement element,
      optional ForwardedGestures gestures = {});
};

...then I am supportive. Wdys?

(Spoiler alert - this shape also solves other issues you have pointed out, and which I address further below.)

  • I could see pinching (missing for google map)

Currently, not requested by Web developers, so not a priority for me. But if you want to add that, I won't oppose. Please see the comment I have left in the WebIDL above.

  • and keyboard zoom commands (missing for google slides)

I am not familiar with "keyboard zoom commands". I am, however, familiar with previous/next and page-down/page-up. These could be discussed as possible later extension; at the moment, I am hesitant about the security properties here. Let's start small.

  • maybe keyboard arrows as well

These make me very nervous. Let's not have this in the MVP.

  • Clicks and other events might be useful in the long run but are definitely risky security wise.

I am absolutely opposed to forwarding clicks. Such actions require a completely different model - one which I have presented in the past ("Video Portal"). But it is not in scope for us right now, and it's not mutually-exclusive with the current model. (That is, even if the Video Portal model were alive right now, it'd only serve some applications, and I'd argue for the introduction of Captured Surface Control APIs for other types of applications.)

  • Maybe some opt-in from the captured application could solve that issue

Any model that requires opt-in from the captured app, does not solve the problem, because the majority of web pages will not opt-in, and therefore the user will not be properly served.

  • It would be nice for capturer to state its interest to forward user gestures, UA can do this even without capturer asking for it.
  • If so, there should be a way for capturer to state that it is no longer interested to forward user gestures.

If you want the spec to also explicitly say that user agents MAY offer gesture-forwarding even when the application does not opt-in, I am happy to add it. Is this the case?

  • The feature is currently gated by a specific prompt/permission policy. The policy seems too much, and we should investigate whether a prompt/dedicated permission is what we want or whether UA heuristics could be good enough. This might influence API shape.

Chrome Security wanted a permission policy, and I agree with their reasoning. However, user agents that wish to impute permission without a prompt, may do so while remaining trivially compliant with the spec. If you'd like the Captured Surface Control spec to state as much explicitly, then I'd gladly add that. Should I?

  • There is most probably an intent that these events do not allow a web page to know that it is being captured. We should discuss this. This probably means there is something missing when capturing a tab. For instance a captured tab visibility should probably be visible even when it is hidden. I do not see that in the screen share spec, this might be worth filing an issue on the screen share spec.

Sorry, no, there is no such intent at the moment.

  • I would leave setZoomLevel on the side for now until we understand what it solves that forwarding user gesture approach cannot.

Web developers have asked for the ability to control zoom, and I think anyone who ever shared a tab during a video call, can immediately understand what this solves.

I am not aware of any gesture that supports this behavior. As I have shown here, pinch-controls are NOT an alternative.

Also, even if pinching were identical to zoom - which it is not - we'd still need to serve users without touchscreens, and applications that want to show zoom in/out buttons.

Further yet, the read-access provided by getZoomLevel() is useful for applications that want to show the current zoom level to the user - and that really helps the Web application communicate to the user what the zoom in/out buttons do. (See mock in the explainer, and implementation by Meet.)

Zoom-control is totally in-scope for me, and irreplaceable by pinch-controls.

  • captureWheel(HTMLMediaElement e) is probably not right. captureWheel(HTMLMediaElement? e) might be better to allow unsetting (seems like a MUST have). And probably captureWheel(HTMLVideoElement? v) might be even better from a type perspective.

~Although I disagree, and although I'd prefer to retain HTMLElement as the type here, I want us to reach a mutually-satisfactory compromise. Let's assume for now that I have changed forwardGestures() to deal with HTMLVideoElements; I will think about it for a bit and then make the change if it still seems fine.~ Edit, 21-10-2024: Web developers I have checked with say they need this flexibility to forward wheel events from an overlay draped on top of the video element. If we were to introduce this limitation, we would make their lives easier; however, we would not actually prevent anything, as they could always overlay yet another invisible video element on top of everything else and forward from there. Let's not introduce limitations if they don't serve a purpose.

  • The restriction to one video element is a bit artificial. I am not sure why we cannot allow a web page to show the same captured media in two elements (one cropped and the other one uncropped for instance) and allow both of them to forward user gestures.

Fine by me, although I am not aware of any Web developers requesting this. The API shape I have proposed earlier in this comment allows this neatly, because developers can do the following:

  // Start for e1 and e2.
  controller.forwardGestures(e1, {wheel: true});
  controller.forwardGestures(e2, {wheel: true});

  // Stop e1 only.
  controller.forwardGestures(e1, {wheel: false});

(You will note that I have made forwardGestures() accept HTMLElement rather than HTMLElement?. This is the reason.)

Based on this, I would look at an API along those lines: ... enableGestureForwarding ...

Thank you for this concrete suggestion. Humbly, I disagree with it. :-) I see a few issues, but I think it's enough to name just one - it does not allow for a permission policy, for those user agents that wish to include one.


Thanks again for this in-depth feedback. In summary, I believe that:

  1. The change to forwardGestures() addresses all issues raised and provides the desired proprties. (Future-proof for additional gestures, supports multiple target-elements, etc.)
  2. The spec allows different security and usability philosophies by different browsers in an interoperable manner. (Permission policy do not mandate a prompt, we make an allowance for UA-driven heuristics, etc.)
  3. It has been demonstrated that zoom-control is distinct from gesture-forwarding, and especially, that it cannot be sufficiently emulated using pinch-forwarding.
  4. Both read-access and write-access over the zoom-level of the captured surface is necessary.

What do you say?

guidou commented 5 days ago

captureWheel(HTMLMediaElement e) is probably not right. captureWheel(HTMLMediaElement? e) might be better to allow unsetting (seems like a MUST have).

That is an error in the IDL of the spec, it is indeed intended to be nullable and passing null stops the forwarding. The text of the spec assumes the parameter is nullable.

However, the spec allows HTMLElement?, not just HTMLMediaElement?. The rationale is that developers have expressed that supporting a div that internally contains the media element and other things is important. Due to the use of complex libraries being able to support divs has been a requirement. Perhaps we can require that the element we use for forwarding has a descendant media element.

youennf commented 5 days ago

I understand the urge to quickly go to API but I think it is a bit premature. There are some design decisions we should take consciously:

I would suggest splitting these in sub issues that the WG will discuss in this repo.

eladalon1983 commented 5 days ago

I understand the urge to quickly go to API but I think it is a bit premature.

Your previous comment raised specific concerns, and I engaged with them, providing what I believe to be real solutions to real concerns. It would be nice if you could indicate which of these issues were satisfactorily resolved for you.

There are some design decisions we should take consciously:

The length of the previous comment indicated to me that it was an exhaustive list. I now hear that it wasn't. Is the current list exhaustive, or do you anticipate additional issues?

Allow/disallow web pages to select specific types of events and not others

No Web developer has yet requested this. If they do, we can add this capability.

Understand how web applications will know whether forwarding is working or not

If the promise returned by forwardGestures() is resolved, forwarding is working, until the capture stops. I don't see any missing information...?

(my experience so far on Chrome Canari shows this is a real issue)

If you have any issues with Chrome, your bug reports would be most appreciated. If you have issues with the demo itself, I'd love to fix it - just let me know!

Whether this is a capture controller thing or a MediaStreamTrack thing (difference in transferability)

To me, it's obviously a CaptureController thing. Can you indicate requests from Web developers for this to be used in some manner that could be addressed with MediaStreamTrack but not with CaptureController?

Use permissions model or not (this is JS observable so we cannot have one UA do it and not another one)

I believe it is permissible for user agents to differ in their implementation of permissions policies, including prompts. On Chrome, sometimes we get one-time permissions, and that too is JS observable. If there is a problem here, please explain it to me.

Decide whether setting a specific zoom level should be part of the MVP

It is requested by Web developers. Why should it not be part of the MVP?

(AFAIK, apps like browsers tend to provide easier access to increment zoom in/out)

Such controls are in the captured app. The entire point of this API is to provide controls in the capturing app, without forcing the user to switch to the captured app.

Explore Command/+ and Command/- and pinch for zoom

Easily extensible at a later time. No need to delay the MVP on such extensions, which were NOT requested by Web developers. (But please feel free to send a PR if you are interested.)

Explore what could be translated to native surfaces and what could not be

Any reason for this to be in the MVP?

Decide whether we should guarantee or not care about captured tab knowing it is captured

Doesn't sound like a blocking issue to me. I'm fine either way.


I would really appreciate it if you could:

  1. Indicate which prior issues are resolved from your point of view, given the answers I have previously given, and given the changes I have made in response to your earlier feedback.
  2. Indicate which issues are consensus-blocking and which are not.
jan-ivar commented 5 days ago

Lots to respond to. But first, my impression of https://captured-surface-control.glitch.me/

  1. Scrolling is a mighty cool feature. Nicely done!
  2. The Click again to grant permission button seems confusing, brings up no prompt, and feels unnecessary, and it's never clarified what is given up (discussed in https://github.com/screen-share/captured-surface-control/issues/27)
  3. The zoom controls seem easily fixed: image
eladalon1983 commented 5 days ago
  1. Scrolling is a mighty cool feature. Nicely done!

Thanks!

  1. The Click again to grant permission button seems confusing, brings up no prompt, and feels unnecessary, and it's never clarified what is given up (discussed in https://github.com/screen-share/captured-surface-control/issues/27)

The demo is merely for illustration. For a real use example directly from a professional app, see this instead:

  1. The zoom controls seem easily fixed:

Thanks for trying, but I don't think the Meet or Chrome teams are eager to adopt this UX. Controls belong directly in the Web application. Our user studies show quite poor discoverability for anything outside of the viewport.

jan-ivar commented 5 days ago

... Controls belong directly in the Web application. Our user studies show quite poor discoverability for anything outside of the viewport.

Is that bad news for the other capture-related controls in the chrome bar, plus Captured Surface Switching?

For a real use example directly from a professional app, see this instead:

Thanks! I agree looking at real apps is helpful. But how is this placement superior?

image

It's obscuring what I'm sharing.

Isn't it better here?

image

Here it would work (the same) in every VC app instantly, not just Meet, without permission or opt-in, without scaring anyone.

Did the user studies consider whether the consistent placement across all VC apps aided discoverability?

eladalon1983 commented 5 days ago

Co-chair Jan-Ivar, I worry that it would not help facilitate reaching consensus, if we were to bring yet more topics into a thread that is so chokeful of them. Youenn is limited to this format because he's not yet sure he can comment on the original repo. Since you have no such limitation, please consider filing separate issues on the original repo. Thank you.

Lest I seem to be avoiding your questions, my responses follow. But if at all possible, please spin off further discussion into distinct issues.

Is that bad news for the other capture-related controls in the chrome bar, plus Captured Surface Switching?

Discoverability is always a challenge. The general way to address it is to place the controls in the most immediately relevant place. In the case of dynamic-switching, that means moving the controls closer to the would-be newly-captured surface. In contrast, in the case of mic/camera controls, this means in the viewport, thanks to PEPC.

Thanks! I agree looking at real apps is helpful. But how is this placement superior? ... Isn't it better here?

Product managers and UX experts from both Web developers as well as Chrome browser have expressed the desire to see these specific control in the capturing Web application. I have no reason to doubt their expertise in this matter, impressed though I am with your own UX-design insights.

In our estimation, the UX you suggest is NOT preferable. (Ample context provided earlier in this comment, in our previous WG meetings, and during editors meetings.)

I am looking forward to seeing Firefox's experimentation with these browser-level controls. You will note that the Captured Surface Control spec does not hinder your ability to bake such functionality into your browser. Please do share any results you can once you have them.

Did the user studies consider whether the consistent placement across all VC apps aided discoverability?

Uniformity across applications is definitely a plus. But so are:

jan-ivar commented 3 days ago

With my co-chair hat, I recommend we separate issues in this repo then where we all can engage. I've moved my permission issue to https://github.com/w3c/mediacapture-screen-share-extensions/issues/14 to start. My preference would be for the reporter or proposer to do this, but happy to help. I'm also open to engage here.

With my member hat, my feedback was if browsers can reasonably provide zoom through their own UX for now, this seems useful for triaging: it lets us increment, which perhaps means we can focus on forwarding gestures for now. I plan to comment on @youennf's suggestion next, either here on in whatever issue we prefer.

eladalon1983 commented 3 days ago

I recommend we separate issues in this repo then where we all can engage

Agreed. And I further recommend that those who bring up new issues like this one, do so in a new issue.

if browsers can reasonably provide zoom through their own UX for now

Chrome does not plan to provide zoom through its UX. Does Firefox? Does Safari? If no browser has such plans, then this is not a real possibility.

jan-ivar commented 2 days ago

Based on this, I would look at an API along those lines:

partial interface HTMLVideoElement {
  attribute boolean enableGestureForwarding;
};

... This ... unties the API from permissions.

I favor this API. Reusing the playback coupling through video.srcObject is elegant, intuitive, and foolproof. We should consider letting UAs limit forwarding to when the capture is visibly playing, so users see what they're interacting with.

 controller.forwardGestures(e1, {wheel: true});

I do not favor this API. It allows mistakes like forwarding to a capture other than the one playing. Also, dictionary default rules mean controller.forwardGestures(e1) turns forwarding off which seems surprising and unintuitive.

I'm also not convinced apps need to cherrypick behaviors in the first version. I'm not convinced about divs.

Chrome does not plan to provide zoom through its UX. Does Firefox? Does Safari?

Firefox might put zoom controls in the video element (next to its PiP button).

eladalon1983 commented 2 days ago

partial interface HTMLVideoElement { attribute boolean enableGestureForwarding; };

I favor this API.

As we have covered on multiple threads ([1], [2]):

  1. This would prevent the use of a permission policy (because it does not return a promise).
  2. Web developers have a strong need for types other than HTMLVideoElement. (Rationale reiterated further down.)
  3. Limiting to HTMLVideoElement does not block any attack vector, so the limitation has no justification.
  4. What happens if the stream plugged into the HTMLVideoElement first goes through processing? This API shape scopes things to the track instead of the capture; this is a mistake.

Additionally:

  1. This API shape would not lend itself well to extension of the list of element types.
  2. This is not a reasonable point of exposure, as Web developers are not liable to understand that it relates to capture.

Points 2 and 3 are arguably captured by the previous discussion here, where I am still waiting for your response.

 controller.forwardGestures(e1, {wheel: true});

I do not favor this API. It allows mistakes like forwarding to a capture other than the one playing. Also, dictionary default rules mean controller.forwardGestures(e1) turns forwarding off which seems surprising and unintuitive.

The change from captureWheel(element) to forwardGestures(element, gestures) was intended to address Youenn's concerns about:

  1. Extensibility to additional gestures (e.g. pinch) without an additional method.
  2. Ability to forward from multiple elements to the same captured surface.

I'd argue that:

It was only out of a desire to be flexible and accommodating of Youenn, that I offered to change the API this way. But since this is blocking consensus rather than helping it - because Mozilla objects to this new shape - let's go back to captureWheel(element), which is good enough for all Web developers we've heard from.

I'm also not convinced apps need to cherrypick behaviors in the first version.

I don't understand this part. Please clarify...?

I'm not convinced about divs.

We have Web developer feedback that this is necessary, and their rationale is clear and reasonable; namely, they want to overlay the video with an element on which they draw emoji reactions and announcements, and these must not block scrolling. If you have reasons to doubt these Web developers, please share your thoughts. But let's concentrate on our role of serving Web developers and users.

Chrome does not plan to provide zoom through its UX. Does Firefox? Does Safari?

Firefox might put zoom controls in the video element (next to its PiP button).

First, your alternative does not solve the problem - Web developers need the ability to overlay customized controls zoom-controls at their chosen position, alongside additional app-level controls, or overlaid on top of the video preview tile. Developers need to be able to do either of these, and not all be forced into the same design.

Second, you do not provide a real commitment by Firefox to your proposed alternative. You say Firefox "might" do it; this implies it also might NOT do it.

jan-ivar commented 1 day ago
  1. This would prevent the use of a permission policy (because it does not return a promise).

The UA can prompt instead of scroll (or prompt instead of zoom +/- button). No promise needed.

  1. ... they want to overlay the video with an element on which they draw emoji reactions and announcements, and these must not block scrolling.

This sounds more like a general input problem, not something that needs to affect API. Doesn't CSS already have:

.overlay {
  pointer-events: none; /* Allows pointer events to pass through */
}

The UA controls the horizontal and the vertical, so I'm sure we can come up with a rule where it must forward scrolling here. Worst case: a div.enableGestureForwarding to forward gestures to the video element underneath.

  1. Limiting to HTMLVideoElement does not block any attack vector, so the limitation has no justification.

The attack vector was explained in https://github.com/w3c/mediacapture-screen-share-extensions/issues/14#issuecomment-2426657705. Tying scrolling to playback seems a reasonable direction to help UAs mitigate this.

  1. What happens if the stream plugged into the HTMLVideoElement first goes through processing?

I would leave it up to the UA to make a determination.

This API shape scopes things to the track instead of the capture; this is a mistake.

I don't believe it does that. If we agree on "multiple target-elements", I assume we agree interaction is carried upstream so it affects all of them.

... But since this is blocking consensus rather than helping it - because Mozilla objects to this new shape - let's go back to captureWheel(element)

Both enableGestureForwarding and forwardGestures are better names than captureWheel. I see no reason to go backwards on naming.

With my co-chair hat, I ask that we not "assign intent or interpretations to other contributors' comments". Mozilla has not formally objected to anything yet.

youennf commented 1 day ago

This thread starts to be long. My hope was to provide feedback in one place and then break down in smaller pieces that we could try to get consensus on. And then dive into API shape and bike shedding.

How do people think of the list of items in https://github.com/w3c/mediacapture-screen-share-extensions/issues/13#issuecomment-2422980720. Are there more items we can discuss one after the other?

I see one item seems already covered by https://github.com/w3c/mediacapture-screen-share-extensions/issues/14. I'll comment there. Maybe this is the first question we can sort out?

jan-ivar commented 1 day ago

That sounds reasonable to me.

eladalon1983 commented 1 day ago

The UA can prompt instead of scroll (or prompt instead of zoom +/- button). No promise needed.

The established pattern is to use permission policies and return a promise through which the application can detect whether permission was granted or not. Your suggestion does not align with any precedents I am familiar with, nor does it sound desirable on its own merits. (For instance, how does the application discover the result of the prompt? Is it a blocking prompt?)

Further, this comment shows how Youenn made the perfect argument for permission policies. Namely, he raised the need for user agents to allow revocation. I completely agree, and here we benefit of the fact that permission policies have already been specified and implemented, saving us the need to roll our own - and the mistakes we would surely have made.

Doesn't CSS already have ... Worst case: a div.enableGestureForwarding to forward gestures to the video element underneath.

We have clear feedback from developers about what they need to serve their users. The current API shape provides them just that. We do not have any indication about how limiting to video elements would improve the lives of users or developers.

The attack vector was explained in #14 (comment). Tying scrolling to playback seems a reasonable direction to help UAs mitigate this.

It has been sufficiently demonstrated that limiting to specific elements would not mitigate that attack vector. It would only annoy developers, but it won't stop abuse. (Nor do we have reason to expect abuse, btw. So much so that you argue we shouldn't even have a permission prompt...)

I would leave it up to the UA to make a determination.

I can easily think of how these heuristics could backfire. I cannot imagine what it would gain us. (See previous comments about not preventing abuse.)

I don't believe it does that. If we agree on "multiple target-elements", I assume we agree interaction is carried upstream so it affects all of them.

I don't understand this comment.

Both enableGestureForwarding and forwardGestures are better names than captureWheel. I see no reason to go backwards on naming.

If you have a name that matches with the captureWheel() shape, I am fine with that. For example, forwardWheel(element). But since you do not like forwardGestures(element, gestures), that one is out.

With my co-chair hat, I ask that we not "assign intent or interpretations to other contributors' comments". Mozilla has not formally objected to anything yet.

You wrote: "I do not favor this API." Regardless of formality, it follows that a pivot to this API shape is not helping us reach consensus. But if you do prefer forwardGestures(element, gestures) over other alternatives that are currently on the table, that is still an option for me.


How do people think of the list of items in #13 (comment).

I would appreciate:

  1. An exhaustive list of open issues (or as close to exhaustive list as can be reasonably expected).
  2. An indication of which points have been satisfactorily address and may be closed.