w3c / mediacapture-handle

https://w3c.github.io/mediacapture-handle/
Other
14 stars 10 forks source link

Should the handle be an object? #68

Open eladalon1983 opened 1 year ago

eladalon1983 commented 1 year ago

At the moment, the handle field is a string. This limits it to stringable objects rather than serializable objects. If we change to an object, applications will be able to expose CropTargets as well.

navigator.mediaDevices.setCaptureHandleConfig({
  exposeOrigin: true,
  permittedOrigins: ['*'],
  handle: {
    field1: "value1",
    field2: "value2",
    cropTargets: {
      label1: cropTarget1,
      label2: cropTarget2,
    },
  },
});

@youennf, I think you were amenable to this during TPAC 2022...? @jan-ivar, your opinion?

Note that I still intend to pursue orthogonal proposals for:

  1. Structured exposure of CropTargets.
  2. Structured exposure of a suggested contentHint.
  3. Exposing a MessagePort (transferable but not serializeable, btw) for bidirectional unstructured messaging. (The handle is one way and immediate, which is important [a] for Conditional Focus without exceeding the task-end limit, and [b] for capturers that don't wish to communicate with the capturee and alert it to the existence of a capture session.)
youennf commented 1 year ago

@youennf, I think you were amenable to this during TPAC 2022...?

I think this is a reasonable answer to the crop target on another tab use case. An alternative to consider is to enable MessagePort level communication.

One thing to consider is that the current 1024 bytes size might be harder to define with objects.

As a side note, and this also applies to structured exposure of CropTargets, if a web application starts to create CropTargets proactively, UAs should make the creation of such objects cheap and only start expensive operations when these objects are actually used to crop. This is another argument in favour of creating CropTarget synchronously.

eladalon1983 commented 1 year ago

One thing to consider is that the current 1024 bytes size might be harder to define with objects.

I intended that limitation to defang potential attacks where a captured site overwhelms a capturing site with extremely large messages. That's probably¹ not terribly concerning, though, as the captured site would have to pay as much for the attack as would the victim, and the cost would be paid without the attacker even knowing if there is a victim, or who it is. I think it's safe to drop this limitation².

An alternative to consider is to enable MessagePort level communication.

There is value in the capturer being able to read something about the capturee, without letting the capturee know that there is a capturer.

As a side note, and this also applies to structured exposure of CropTargets, if a web application starts to create CropTargets proactively, UAs should make the creation of such objects cheap and only start expensive operations when these objects are actually used to crop. This is another argument in favour of creating CropTarget synchronously.

Agreed. (But there are other arguments both ways.) For now, I am not concerned, because I expect a very limited number of CropTargets. But, yes, that is indeed a valid argument. If we later observe that sites often expose several CropTargets, this argument could increase in weight.

-- [1] Or wdyt? [2] I intend to ask Chrome's security experts.

eladalon1983 commented 1 year ago

To help drive collaboration between loosely-coupled apps, I have presented the following earlier today.

Assume:

dictionary ContentHints {
  DOMString audio;
  DOMString video;
};

dictionary CropTargetData {
  DOMString name;
  CropTarget cropTarget;
  ContentHints suggestedContentHint;
};

dictionary CaptureHandleConfig {
  ...  // Pre-existing.

  // Suggested Content-Hint for uncropped video.
  ContentHints suggestedContentHints;

  // CropTargets of interest.
  sequence<CropTargetData> cropTargets;
};

Then a VC app could do something like this:

const stream = navigator.mediaDevices.getDisplayMedia(constraints);
const [track] = stream.getVideoTracks();
const captureHandle = track.getCaptureHandle();

if (Object.keys(captureHandle?.cropTargets).length > 1) {
  // 1. Cycle through the crop-targets,
  // grabbing one frame with each.
  const thumbnails = GetThumbnails(track);

  // 2. Display an in-content picker.
  // The user chooses if they want to share
  // the entire tab, or only a specific part of it.
  const target = PromptUser(track, thumbnails);

  // 3. Apply user’s choice.
  track.cropTo(target);  // No-op if `target` is `undefined`.

  // 4. Transmit remotely.
  TransmitRemotely(track);
}

To put a cherry on top, in the TransmitRemotely() step, the VC app can - if it chooses - use the suggestedContentHint tied to the specific region.

This means that Teams, Zoom and Meet could all offer great functionality and performonace when capturing YouTube, Hulu, Slides, PowerPoint or anything else.

jan-ivar commented 1 year ago

TL;DR: I don't think handle should be an object, as this seems like a design break.

captureHandle.handle was billed as an "identity" string for "Bootstrapping" "a connection" [1]:

The means for transmitting these messages are outside the scope of this document. Some options are:

  • Shared cloud infrastructure.
  • Messaging via a worker. (Note: Storage Partitioning might disrupt this option.)
  • A rudimentary messaging API might be added expressly for this purpose.

Letting handle accept serializable objects doesn't fit with that design, and takes us instead another step closer to reinventing postMessage, which I warned about in https://github.com/w3c/mediacapture-handle/issues/11.

If storage partitioning has made out-of-band messaging nonviable, CropTargets aren't serializable, and we're trying to send objects through handle, then we seem overdue for the third option IMHO... a rudimentary messaging API.

A way forward on that might be:

  1. Address https://github.com/w3c/mediacapture-handle/issues/11 first
  2. Move captureHandle to captureController first to avoid message fan-out so we don't spam every track clone
  3. Specify a port on each captureController and a specialized port somewhere on the capturee side that caches the last message sent on it ahead of capture, to surface it on the captureController upon gDM success.

If the capturer never responds, then the capturee never learns it's being captured.

eladalon1983 commented 1 year ago

If the capturer never responds, then the capturee never learns it's being captured.

I disagree with most parts of the previous message, but I think we can zero in on one point first - the one I quoted. We fundamentally disagree on how MessagePort would look like and how it could be used. I don't think the model you suggest is even possible, let alone desirable.

First, feasibility. A tab can be captured by multiple capturers. A design centered around a MessagePort will require that the captured page register an event handler to receive a MessagePort after the capturer "dials in" by calling CaptureController.getMessagePort(). That means that captured pages cannot send any messages ahead of knowing if they're being captured.

Second, even if this model were possible, it's evident that you noticed the following issue and tried to sidestep it - apps shouldn't have to waste CPU to repeatedly send the same message until a capturer materializes, which is why you introduced the caching mechanism. That is, you reinvented the handle.

jan-ivar commented 1 year ago

Let's take a step back then. My main point was:

we seem overdue for the third option IMHO... a rudimentary messaging API.

Do we agree on this?

This issue already tries to reinvent handle, opening up for alternative ideas. Neither a string identifier nor a traditional message port seem to satisfy requirements, so some innovation seems unavoidable. Since iteration has failed us, I think we need to reexamine the requirements, and figure out what shape makes most sense for them.

With this issue and https://github.com/w3c/mediacapture-handle/issues/11, handle is veering toward reinventing broadcastChannel.postMessage except across origins IMHO. At the same time, permittedOrigins smells like a list of targetOrigins, a feature of window.postMessage. And lastly, a MessagePort can only be transferred over the latter (since broadcast naturally cannot transfer).

The previous paragraph mentions every object that implements postMessage() AFAIK, and while none of them fit, the idea that postMessage might at least be a shape we should consider hopefully seems reasonable.

So for sake of illustration only, we might imagine a broadcast-channel-like API with targetOrigin, something like:

// capturee
navigator.mediaDevices.capturer.postMessage({id, hint, cropTarget}, "*");
navigator.mediaDevices.capturer.onmessage = e => iRecognize(e.origin) && my1on1Port = e.port;

// capturer
captureController.onmessage = e => {
  if (iRecognize(e.origin)) {
    const {port1, port2} = new MessageChannel();
    my1on1Port = port1;
    captureController.postMessage(port2, {targetOrigin: e.origin, transfer: [port2]});
  }
};
eladalon1983 commented 1 year ago

First, this illustration assumes that the capturee can only be captured by a single capturer. That is false, as I have pointed out in my last message. I have also explained that this should not be abstracted away for the sake of illustration, because it has corollaries. Namely, the only model that will work for producing a MessagePort, is one in which the capturee registers to be notified when the capturer tries to establish a MessagePort to it, and this notification has ramifications on issues which you have expressed concern for - self-censorship.

Second, this does not work for loosely-coupled applications, as they would not be aligned on the protocol for the messages sent over the MessagePort. Again - a use-case you have previously cared much about. It is for those loosely-coupled apps that I have produced the proposal in this message.

jan-ivar commented 1 year ago

No, the illustration is of a new API where mediaDevices.capturer.postMessage() is broadcast (one-to-many) and cannot accept transferable objects, whereas the other direction captureController.postMessage can (since there is only one recipient).

Self-censorhip is already possible in the current spec once the capturer establishes a "connection".

I don't think we should solve more features for loosely-coupled applications atm.

jan-ivar commented 1 year ago

So to clarify the illustration in https://github.com/w3c/mediacapture-handle/issues/68#issuecomment-1284321553 further:

eladalon1983 commented 1 year ago

Self-censorhip is already possible in the current spec once the capturer establishes a "connection".

The current spec does NOT force an establishment of a connection. My proposal shows a simple mechanism for publishing actionable information without requiring a connection. That proposal was designed precisely to address the concerns you have previously raised, which were (1) self-censorship and (2) usefulness to loosely-coupled apps.

I don't think we should solve more features for loosely-coupled applications atm.

Could you please explain this 180-degrees change in your position?

jan-ivar commented 1 year ago

First, thank you for attempting to address concerns you thought I might have. I appreciate that, but it's not that simple.

It goes back to use cases. I found users having nextslide and previousslide on a loosely-coupled captured page compelling. But I'm fairly confident those may also end up becoming keyboard hardware hotkeys someday, which is a high bar.

But those also opened up a long tail of ideas that didn't make it ("firstslide", "lastslide", https://github.com/w3c/mediasession/issues/282) and maybe never should, because there's a risk and a cost to baking into the web platform forever what we think apps want today.

I don't find loosely-coupled pages advertising crop regions compelling enough to bake into the web platform now. I don't find loosely-coupled pages advertising content hints compelling enough to bake into the web platform now.

It's not clear to me we're ready to say how this should work forever. App-controls also encroach on what's a fully user-controlled feature today, so there are risks they may sometimes undermine user choice if we're not careful.

Since this spec already enables apps to experiment in this space, I'd rather wait an see how they solve this.

we seem overdue for the third option IMHO... a rudimentary messaging API.

Do we agree on this?

I remain curious about an answer to this question.

eladalon1983 commented 1 year ago

I don't find loosely-coupled pages advertising crop regions compelling enough to bake into the web platform now. I don't find loosely-coupled pages advertising content hints compelling enough to bake into the web platform now.

I might have been premature in moving Capture Handle from the WICG to the WebRTC Working Group. It's unclear to me that we are aligned on the vision here. I've also not heard of any concrete plans from Firefox or Safari to implement the spec, so the timelines we have in mind are going to be different. Shall we rectify this mistake by moving the spec back to the WICG?

we seem overdue for the third option IMHO... a rudimentary messaging API.

Do we agree on this?

I remain curious about an answer to this question.

If you mean Capture Actions - I have been supportive multiple times. But this is an orthogonal use-case. If you mean adding a MessagePort - I have proposed that in the very first message in the current thread.

jan-ivar commented 1 year ago

I mean adding a MessagePort (or rather a messaging scheme https://github.com/w3c/mediacapture-handle/issues/68#issuecomment-1284321553). It seems to me we can make progress there, which should give collaborating apps everything they need to experiment with handling these use cases.

eladalon1983 commented 1 year ago

We are not aligned on any topic, and no browser other than Chrome has shared concrete plans to implement Capture Handle. I don't think anyone benefits of Capture Handle remaining in the WebRTC WG. The current setup is counter-productive. Let's move this back to the WICG.

eladalon1983 commented 1 year ago

I don't think we should solve more features for loosely-coupled applications atm.

Could you please explain this 180-degrees change in your position?

It occurs to me that #69 attempts to deprecate Capture Handle Identity altogether, but only works for tightly-coupled apps. (An unsuccessful attempt imho, e.g. because of Conditional Focus' window of opportunity.)

jan-ivar commented 1 year ago

That is a misunderstanding.

jan-ivar commented 1 year ago

I sense generally good agreement on the use case here, with the only differences being over API shape. This seems like something we should be able to make progress on.

My concerns that I don't believe have been addressed here are:

  1. Calling setCaptureHandleConfig rapidly doesn't seem web compatible wrt what a capturer sees. From its algorithm, it looks like messages can get missed due to timing. postMessage solves this using a queue.
  2. There's no surfacing of deserialization errors, which seems necessary to handle arbitrary objects (e.g. FileReaderSync and several other objects cannot be instantiated in Window). postMessage has onmessageerror for this.

While (1) might be addressed by adding the message to the event, this would seem to go against § 7.7. Use plain Events for state, perhaps suggesting that recreating (post) message-passing behavior using events is not a desirable pattern.

eladalon1983 commented 1 year ago

I sense generally good agreement on the use case here, with the only differences being over API shape.

I think there are multiple use cases here, and we only have agreement on some. It may be that the use-cases where we disagree are appear uninteresting to you, but they're important to me, so I disagree with the characterization of "general agreement".

To clarify, we disagree on:

I intend to get back to this topic (future reshuffles of duties and priorities notwithstanding). When that happens, if you still find these uninteresting, to the point of objecting to tackling the problem, then I don't see a way forward for me other than to fork the document and proceed with other interested parties. If you can think of a nicer way forward, please do share it.

eladalon1983 commented 1 year ago

From its algorithm, it looks like messages can get missed due to timing.

How are the events insufficient for ensuring that nothing is missed?

There's no surfacing of deserialization errors,

this would seem to go against § 7.7. Use plain Events for state,

As mentioned, I don't see any problems with the current approach; in fact, you have yourself convinced me to stop exposing the message on the event, because we both agreed it was unnecessary - see here.

But assume for the sake of argument that adding the message back were necessary. The first words in § 7.7. Use plain Events are: "Where possible".