Expose a MessagePort using Capture Handle

eladalon1983 commented 1 year ago

Problem Statement

When an application screen-captures another, it is often useful for the user if these two apps can then integrate more closely with each other. For example, a video conferencing application may allow the user to start remote-controlling a slides presentation.

Capture Handle introduced the ability for a capturee to declare its identity to a capturer. This identity can be used to kick off communication, either over shared cloud infrastructure, or locally, e.g. over a BroadcastChannel. Local communication is more efficient and robust, and is therefore much preferable. But what if the two apps are separated by Storage Partitioning? For that, it’s useful to set up a dedicated MessagePort between capturer and capturee.

Scoping

Note that a MessagePort cannot address all use cases we have in mind, and cannot replace Capture Handle, nor some of Capture Handle's future extensions.

Conditional Focus requires an immediate decision, or else the window of opportunity closes.
Loosely-coupled applications have no use for MessagePort, as the messages flowing over it will be in an unrecognized format.

The discussion is therefore scoped to the use case we can hope to address - improving things for tightly-coupled applications after capture has started and Conditional Focus decided, so as to allow a more ergonomic, efficient and robust communication.

Challenges

We note some challenges that a good solution must address:

A captured tab’s top-level document may be navigated at any time. When that happens, any MessagePort that the capturer may be holding from before, becomes useless. The capturer should stop using it. An event is needed.
Similarly, if surfaceSwitching is specified, users may change the captured tab at any time.
The captured document may become ready to receive messages either before or after the capture starts. This again suggests that the capturer needs an event.
Multiple concurrent captures are possible. (The capturers may be distinct - or not.)
It is desirable that the capturee would only become alerted to the presence of a new capture-session, if the capturer chooses to take an action that reveals this.

Proposed Solution

Observe that Capture Handle already produces events that can be used on the capturing side to address the challenges specified above.

Extend CaptureHandleConfig with an event handler:

partial dictionary CaptureHandleConfig {
  EventHandler newCapturerEventHandler;
};

This allows the capturee to receive a dedicated event with a MessagePort whenever a capturer chooses to initiate contact.

interface NewCapturerEvent {
  attribute Type type;  // "started" or "stopped"
  attribute MessagePort port;
}

A channel is established for the capturee when it gets a NewCapturerEvent with type set to "started". When the session ends, the capturee gets a new event with the very same port, but with type now set to "stopped".

To trigger the "started" event on the capturee, a capturer calls the following API:

partial interface CaptureController {
  MessagePort getMessagePort();
}

To check if it makes sense to call getMessagePort(), the capturer must check CaptureHandle.supportsMessagePort.

partial dictionary CaptureHandle {
  boolean supportsMessagePort;
};

The value of CaptureHandle.supportsMessagePort is determined by whether the capturee has set a handler or not.

The capturee may change the CaptureHandleConfig without breaking off existing channels.

The channel is broken if:

The capture-session ends for whatever reason. (User-initiated or app-initiated.)
The capturee is navigated.
The user uses dynamic switching to change the captured surface.

We extend the capturehandlechange event to help the capturer distinguish non-channel-breaking events from channel-breaking events.

interface CaptureHandleChangeEvent {
  attribute boolean messagePortInvalidated;
}

Fine Details

getMessagePort() throws if !getCaptureHandle().supportsMessagePort.
getMessagePort() returns a port leading to the capturee indicated by the last capturehandlechange which was processed by the capturer. This MessagePort might already be useless, e.g. if the captured tab has been asynchronously navigated. This will be detected by the capturer when it processes the relevant event.
If the user uses dynamic switching to change away from a tab and back to it, the old channel remains disconnected. The capturer and capturee may establish a new connection if they still want to talk.

Security Considerations

Captured apps are encouraged to validate the origin of messages. As MessagePorts are transferrable, it is imperative to check each individual message's origin.

Open Questions

Should the capturer be allowed to call getMessagePort() multiple times and establish multiple connections with the same capturee? That could potentially mislead the capturee as to how many capture-sessions there are. However, that seems like a niche concern, especially given that the apps are tightly-coupled.

Sample Usage

On the captured side:

function onPageLoaded() {
  navigator.mediaDevices.setCaptureHandleConfig({
    exposeOrigin: true,
    handle: "...",
    permittedOrigins: [...],
    newCapturerEventHandler: onNewCapturer,
  });
}

function onNewCapturer(event) {
  if (event.type == "started" &&
      IsTrustedOrigin(event.origin)) {
    StartCommunicationWithNewCapturer(event.port);
  }
}

On the capturing side:

const stream = await navigator.mediaDevices.getDisplayMedia();
const [track] = stream.getVideoTracks();
track.oncapturehandlechange = (event) => {
  const handle = track.getCaptureHandle();
  if (handle && IsTrustedOrigin(handle.origin) &&
      handle.supportsMessagePort) {
      StartCommunicationWithCapturee(handle.getMessagePort())
  }
};

eladalon1983 commented 1 year ago

@jan-ivar, any thoughts here?

youennf commented 1 year ago

A few thoughts/suggestions:

It should be required for capturer to know the captured origin to do postMessage. The fact capture handle is exposed at the track but not at controller level is something we should think of.
We should try to reuse as much as possible existing web API patterns, at least this early in the design. The usual design is to expose postMessage()/onmessage and allow applications to transfer MessagePorts.
I would try to make the proposal as small as possible, we can always extend it later. Some of the APIs you are describing might not be strictly required, for instance supportsMessagePort or started/stopped.

How about the following:

partial interface CaptureController {// Should it be named DisplayCaptureController?
    undefined postMessage(...);
    attribute EventHandler onmessage;
}

partial interface MediaDevices {
    // This event has a source attribute of type DisplayCapturer
    attribute EventHandler oncapturermessage; 
}

interface DisplayCapturer {
    undefined postMessage(...);
}

The assumption is that CaptureController would have access to the latest CaptureHandle information, which is not the case right now. Another thing to consider is whether postMessage actually drops messages in case capture ends or capture changes of surface. We can probably add non-racy checks at postMessage call and event firing times, but I am unsure whether this is actually needed.

arnaudbud commented 1 year ago

Dialpad would benefit if a video-conferencing product were able to securely remote-control a presentation product, locally in the same browser, as well as remotely by an other participant. I support this proposal.

jan-ivar commented 1 year ago

I like the minimal API in https://github.com/w3c/mediacapture-handle/issues/70#issuecomment-1406345920 on the controller, assuming postMessage is modeled on the one from Window:

partial interface CaptureController {
  undefined postMessage(any message, USVString targetOrigin, optional sequence<object> transfer = []);
  undefined postMessage(any message, optional WindowPostMessageOptions options = {});

That way, apps have targetOrigin to deal with navigation in the capturee.

youennf commented 1 year ago

That way, apps have targetOrigin to deal with navigation in the capturee.

Adding targetOrigin would allow to decorrelate the two APIs, but would allow capturer to span captures with '*'.

There are a couple of questions in that area that would help driving the exact algorithms and API shapes:

If capturee did not opt-into CaptureHandle (but registered the event listener), should capturer be able to postMessage events?
If capturer and capturee relationship is fully broken (capture stopped say), what should be the behavior? Should postMessage continue to work? Should it silently fail? Should it fire to the new capturee if same origin?
If capturer and capturee relationship is paused (user changed capturee surface), what should be the behavior? Should postMessage continue to work? Should it silently fail? Should it fire to the new capturee if same origin?

I think we have ways to build whatever we want there. I would tend to be strict in an initial version, and think about relaxing the rules progressively.

eladalon1983 commented 1 year ago

It should be required for capturer to know the captured origin to do postMessage.

Code that cares about the target's origin would look like this:

if (track.getCaptureHandle().origin == myExpectedOrigin) {
  // postMessage() and so on.
}

The comparison fails for any "real" value of myExpectedOrigin if .origin is undefined. I don't understand what we'd gain from forcing the origin to be exposed, let along why it's important. Could you please explain?

The fact capture handle is exposed at the track but not at controller level is something we should think of.

Maybe we could expose on the controller in addition to the track. But I think it's important to retain the API surface that's already on the track, because tracks are transferrable, and CaptureControllers are not. A receiver of a transferred track might be interested in learning that the track represents a capture of a tab tuned to a specific origin. It would NOT be possible to learn that if exposure is only on the controller, because message passing is async and the information might be out of date by the time the controller's iframe responds (e.g. navigation).

// Should it be named DisplayCaptureController?

I prefer shorter names.
I think it produces potential for confusion - someone could mistake it as only controlling screens.
Renaming would break existing Web apps that already make use of this interface. What gains offset this cost?


partial interface CaptureController {
  ...
  undefined postMessage(...);
  ...
}

I think it's undesirable to expose postMessage on the controller, again because the track is transferable and the controller is not. If an iframe ORIGINAL_CAPTURER initiates the capture and the transfers the track to iframe IFRAME_X, why should IFRAME_X need to keep on bothering ORIGINAL_CAPTURER with requests to relay messages to the capturee on its behalf?

In fact, I now think I've not gone far enough to begin with. I think we should expose the port-getter on the track itself or on the capture handle. Even if we were to make the controller transferrable, it would not be enough, because tracks are cloneable, and clones might be posted to different targets.

but would allow capturer to span captures with '*'

When the top-level is navigated, the new capturee needs to register a new listener, and it should only get messages sent expressly to it. My proposal ensures that, by killing off the old port and forcing the capturer to set up a new one.

If capturee did not opt-into CaptureHandle (but registered the event listener), should capturer be able to postMessage events?

No, it should error. Sending messages to someone that cannot receive them is an app-error, and the app should be made aware, so that its developers may fix the issue. I believe my proposal addresses that through supportsMessagePort.

If capturer and capturee relationship is fully broken (capture stopped say), what should be the behavior? Should postMessage continue to work? Should it silently fail? Should it fire to the new capturee if same origin?

It should stop loudly. I believe my proposal addresses that through NewCapturerEvent{type: "stopped"} and the killing off of the port.

If capturer and capturee relationship is paused (user changed capturee surface), what should be the behavior? Should postMessage continue to work? Should it silently fail? Should it fire to the new capturee if same origin?

That's a new issue. I think it's orthogonal to other design decisions facing as atm. If you agree (do you?), I propose tackling it after we settle other issues.

eladalon1983 commented 1 year ago

One lens to look at things through is - if a track is cloned, and the clones are transferred to two different iframes IF_A and IF_B, then:

Those two iframes should be allowed to communicate to the captured content independently.
Messages arriving from either should be clearly distinct in the receiver, even if IF_A and IF_B are same-origin.
Neither IF_A nor IF_B should need to ask the original initiator of the session (who owns the singular CaptureController instance) to do anything more on their behalf.
- And recall that the capturee could be navigated, which invalidates MessagePorts the original capturer might have produced on behalf of IF_A and IF_B. We don't want IF_A and IF_B to ask the initiator iframe to help each time this happens.
Anything IF_A sends to the captured content, should not be delivered if in the intervening time the tab was navigated, changed, etc.
If IF_A tries to send a message to a target that could not possibly receive it, that's an app-bug, and should result in an exception. (Timing issues notwithstanding.)

I believe my proposal addresses all of these, modulo that I need to change:

partial interface CaptureController {
  MessagePort getMessagePort();
}

To:

partial interface MediaStreamTrack {
  MessagePort getMessagePort();
}

(Or possibly make CaptureHandle an interface rather than a dictionary, and expose it there.)

eladalon1983 commented 1 year ago

@youennf and @jan-ivar, thank you for providing verbal feedback; could you please provide written feedback here, lest we misremember our discussions?

This proposal was briefly presented yesterday at the Screen Capture Community Group March 2023 meeting and there was Web developer interest. It would be good to settle on a shape soon; we intend to implement an origin trial of this API in Chrome soon.

youennf commented 1 year ago

I was not at yesterday's meeting so I am not sure which proposal was presented.

If the discussion is about postMessage vs. getMessagePort, my recollection of our past informal discussions is that there was agreement that the postMessage approach supports all use cases the getMessagePort approach would. The postMessage approach has the benefits of building on a proven pattern (we are on solid ground here) that is already widely in use (good for web developers).

eladalon1983 commented 1 year ago

I was not at yesterday's meeting so I am not sure which proposal was presented.

Sorry for being ambiguous; I meant "this proposed shape which I have presented in this thread."

If the discussion is about postMessage vs. getMessagePort, my recollection of our past informal discussions is that there was agreement that the postMessage approach supports all use cases the getMessagePort approach would.

I think that it's a feature that getMessagePort() gives you a port that will auto-cancel itself when the capture stops. It means the captured application can trust that it's only sending back messages to an entity that's still capturing it. And if it ever wants a port that survives this and persists for longer, then that's still possible using the post-a-port-over-a-port technique you had described.

The postMessage approach has the benefits of building on a proven pattern (we are on solid ground here) that is already widely in use (good for web developers).

I think my proposed method is also on solid ground, as it uses MessagePort.

eladalon1983 commented 1 year ago

I'm going to jot down a list of the benefits and drawbacks of the two approaches soon and solicit some more feedback.

youennf commented 1 year ago

I think that it's a feature that getMessagePort() gives you a port that will auto-cancel itself when the capture stops.

postMessage can allow this naturally, if we decide so. Note also that, to implement this rule in MessagePort, we would need to create a new special Message flavour, which does not seem great.

I haven't made my mind on whether we should enforce this rule or not, it would be worth digging into this (feedback provided earlier in this thread https://github.com/w3c/mediacapture-handle/issues/70#issuecomment-1415366828).

It means the captured application can trust that it's only sending back messages to an entity that's still capturing it.

MessagePorts are transferable so there is no guarantee that the message will be processed by the capturing application. MessagePort and capturing application may also live in different processes/different threads leading to unavoidable race conditions.

The postMessage approach gives us more flexility here. If we want to, we can decide to enforce this rule without any race conditions.

I'm going to jot down a list of the benefits and drawbacks of the two approaches

Before diving into API shape, it would be good to nail down the exact behavior we want. Pros and cons are always good though, let's continue this discussion in a more structured way.

eladalon1983 commented 1 year ago

postMessage can allow this naturally, if we decide so.

That's the capturer->capturee direction. But we want bidirectional messaging, which requires a MessagePort be posted back. And since this will just be a normal run of the mill MessagePort - since we don't atm have any other one - then it won't exhibit this special behavior.

But if we expose a new MessagePort through a getter, we can specify in the getter itself this new behavior. We don't need to modify MessagePort itself.

MessagePorts are transferable so there is no guarantee that the message will be processed by the capturing application.

Transferring the port is delegating; I see it as equivalent to relaying the messages themselves. What my proposal guarantees is that the messages will only be transmitted as long as the capture session is active.

The postMessage approach gives us more flexility here. If we want to, we can decide to enforce this rule without any race conditions.

Could you help me understand why the approaches are different wrt races? Do you mean that if a task starts executing before the session-capture stopped, then postMessage(x) will deliver x even if the session-capture ends while the task is executing? If so, I don't see it as desirable, since tasks can run arbitrarily long.

Before diving into API shape, it would be good to nail down the exact behavior we want.

Bidirectional messaging.
The capturer initiates.
Distinct capture-sessions lead to distinct channels.
The channel is transient - it becomes invalidated if the capture session ends, or if capture-session is retargeted (e.g. share-this-tab-instead).
Both sides get events informing them of invalidation. (Apps can easily avoid doing work to put together messages that won't be delivered. Captured apps can hide away user-facing elements that are only relevant while the capture session is ongoing.)
[Known non-issue] Transient channels can be used to establish permanent ones (through posting a regular MessagePort).
Sender can limit the origin to which the message will go. (On either side.)
Receiver can detect which origin the message came from.
Capturing apps that comprise multiple iframes from multiple origins, can easily move the ownership of the tracks and their associated communication channels, and do not need to resort to cumbersome internal messaging. ("Please send this to the capturer if it has not changed since you have last informed me that it was ${origin}, which was ${notification_num} from my perspective.")

youennf commented 1 year ago

we want bidirectional messaging

postMessage handles this with MessageEvent.source. We talked about this in past meetings, though it was never clearly described here. Let me know if it would help to write down more about this in this thread.

we can specify in the getter itself this new behavior.

I am not clear about this. Either the getter is the place we check and then the MessagePort is live and will not be severed. Or the MessagePort might be severed if capture changes, which would be a change to how MessagePort works, so this will require changes/hooks to the MessagePort spec itself. Or it would

With regards to behavior, I think we agree on 1, 2, 3, 6, 7, 8. About 4 and 5, it would be good to get use cases to motivate this. In any case, 1 to 8 are achievable with the postMessage approach.

6 is interesting in that the MessagePort approach would use the same object (MessagePort) for both transient channels and permanent channels. The postMessage approach would only use MessagePort for permanent channels.

9 is not about behavior but about ergonomics.

eladalon1983 commented 1 year ago

we want bidirectional messaging

postMessage handles this with MessageEvent.source.

Both are solutions provide bidirectional messaging, so we seem to agree on this being a requirement. Great!

we can specify in the getter itself this new behavior.

[...] so this will require changes/hooks to the MessagePort spec itself. Or it would

Please note that you have an unterminated thought there. I'd love to hear the rest of it.

Here is how I generally envision it happening without new hooks in the MessagePort spec:

Return a MessagePort `MP1`, which is entangled to the MessagePort `MP2` in the captured app.
[...]
Run the [severe connection algorithm] if any of the following happens:
* The capture session ends.
* If the user ever instructs the user agent to change the capture source.
* If the top-level document of the captured application is navigated cross-page.

Where the severe connection algorithm roughly consists of:

* Disentangle the ports.
* Queue events on both sides to inform the relevant apps that the ports are invalidated.

About 4 and 5, it would be good to get use cases to motivate this.

The capturer knows when it's capturing X. The capturer knows if the capture session is stopped, either through the capturer's own action or the user's. Through Capture Handle's existing events, the capturer even knows when the capturee changes. But the capturee doesn't know any of this.

So to name just one use case to motivate 4 and 5 - once a channel is established, the capturee might expose user-facing controls to produce action in the capturer. ("Start recording; stop recording; save to disk; discard recording.") Such user-facing controls would have to be hidden away when they become inactionable, which is the case when the capture-session stops.

6 is interesting in that the MessagePort approach would use the same object (MessagePort) for both transient channels and permanent channels. The postMessage approach would only use MessagePort for permanent channels.

Same class, not same object. I don't see it as an issue. Do you?

9 is not about behavior but about ergonomics.

The level of complexity in the app code to handle navigation of the captured-tab would be staggering, and race-prone. This goes beyond mere ergonomics.

eladalon1983 commented 1 year ago

P.S:

postMessage handles this with MessageEvent.source.

Won't we need to modify the MessagePort spec in some way to ensure that CaptureController.onmessage, which is proposed in this comment, is the target of MessageEvent.source?

Do we really want CaptureController to expose postMessage() and onmessage? Should we not have it just expose a port? (I don't support this proposal of yours, but I'd like to not-support the best possible version of it... :-))

youennf commented 1 year ago

Won't we need to modify the MessagePort spec in some way to ensure that CaptureController.onmessage, which is proposed in this comment, is the target of MessageEvent.source?

No change to MessagePort spec needed. The only change outside of WebRTC land would be to update MessageEventSource WebIDL type definition, which is already a known extension point since it is a union type.

Here is how I generally envision it happening without new hooks in the MessagePort spec:

This proposed algorithm is very imprecise, it would be hard to implement it in an interoperable manner. It does not take into account that MessagePorts live in different processes for instance, or that capture lives in another process. I would tend to stick to how specs are currently written these days, something like:

Enqueue a task on the capturee queue that runs the following steps:
If capturee[capturedBy][capturerId] is undefined, abort these steps.
Fire an event blabla...

If we were to do that at MessagePort level, we would need to update https://html.spec.whatwg.org/multipage/web-messaging.html#message-port-post-message-steps, ditto for implementations which would break isolation of MessagePort code from capture code.

the capturee might expose user-facing controls to produce action in the capturer.

This seems reasonable and would call for exposing display capturer as its own object. Events in DisplayCapturer to expose change of capturer state would be a natural fit I think. It would integrate pretty well with the above slot based algorithms to provide precise and consistent state exposures to web pages.

Same class, not same object. I don't see it as an issue. Do you?

It is not great to use the same class to represent two things that have different behaviours.

9 is not about behavior but about ergonomics.

The level of complexity in the app code to handle navigation of the captured-tab would be staggering, and race-prone. This goes beyond mere ergonomics.

Our opinions differ here, but at this stage, this is nothing more than opinions. In terms of ergonomics, code examples or similar would help. In terms of race conditions, deeper analysis of what is racy would be needed.

w3c / mediacapture-handle