w3c / mediacapture-screen-share-extensions

Other
1 stars 0 forks source link

Auto-pause capture when user switches captured content #4

Closed eladalon1983 closed 1 month ago

eladalon1983 commented 1 year ago

Both Chrome and Safari now allow the user to change what surface is captured.

I was too lazy for alt-text. I was too lazy for alt-text.

That's obviously great stuff. Can we make it better still?

So I propose that we add two things:

  1. A control which allows an application to instruct the browser - whenever the user changes what surface is shared, pause the track (by setting enabled to false).
  2. Fire an event whenever this happens. (With the general expectation that the application will apply new processing and then set enabled back to true.)

Possibly the two can be combined, by specifying that setting an event handler signals that the pausing behavior is desired (@alvestrand's idea).

Another natural extension of this idea is to also apply it when a captured tab undergoes cross-origin navigation of the top-level document. When that happens, some applications might wish to stop capture momentarily and (in-content) prompt the user - "do you still want to keep sharing?"

Relevant previous discussion here.

eladalon1983 commented 1 year ago

(Modulo that setting enabled to false does not stop frames being delivered, it delivers black frames instead. So possibly we need another pausing mechanism. But hopefully the general intention is clear, and we can discuss details after we align on the general vision.)

youennf commented 1 year ago

One question here is whether pausing at track or at source level. Pausing at source level makes more sense to me since it is more similar to track.muted than to track.enabled.

eladalon1983 commented 1 year ago

Thank you for migrating the discussion here.

Repeating my last comment on the topic:

Is there a reason, other than implementation complexity[*], for auto-pause to happen on sources rather than tracks?

[*] Normally a valid concern, but here I suspect the complexity is quite low, as pausing could be implemented by dropping frames where applicable. Later, the optimization could be added of not posting frames from the relevant source if all associated tracks are paused. Or wdyt?

I think the comparison to muted and enabled is a response to this. But I am not sure how it applies. From a Web developer's point of view, the ability to interact with tracks as though they were independent sounds useful. Imagine you wish to process one track but not the other; maybe you're applying background blur to a track posted remotely, but showing an unmodified local preview. Why would the latter track need to be auto-paused and unpaused?

jan-ivar commented 1 year ago

Both Chrome and Safari now allow the user to change what surface is captured.

I applaud user-agent-centered features like this! — I like how they tend to keep the end-user in charge.

But before exposing APIs for apps to potentially wrest control over such user-initiated changes, I'd like to understand the concrete problems that require app intervention, since gating output on the app seems directly counter to the user's wishes to inject different input.

What needs cannot be met without pausing?

  • What if an application intends to do some processing that depends on the captured content?

What processing would this be? What app would e.g. blur screen capture?

  • What if an application wants to set different constraints, for instance when capturing a window vs. when capturing a screen?

What constraints would this be? Since a window may be resized by the end-user, and be larger or smaller than a monitor, an app that only reacts to such changes upon source-switching would seem to have the wrong idea.

  • What if the application intends to save different surfaces to different files, and wants to start appending to a new file whenever the user changes the source?

That application can add buttons for the user to do exactly that.

What if the user intends to save different surfaces to the same file, using this user agent feature?

eladalon1983 commented 1 year ago

counter to the user's wishes to inject different input

The user's wish will NOT be subverted; this control will just give the app time to adjust to the change that the user sprung on it. For example, if the user has interacted with the app and asked for the captured media to be written to a file, then when the user clicks share-this-tab-instead, auto-pausing will give the app time to (1) close the old file, (2) open a new file, and (3) start piping new frames to the new file.

What needs cannot be met without pausing?

Please read my earlier messages, as well as the previous paragraph of this message.

What processing would this be?

Cropping using Region Capture. If the user starts sharing a new tab, you might want to pause the capture until you can communicate with the other tab, obtain a CropTarget and apply it.

What constraints would this be?

Meet uses a different frame-rate when capturing tabs, windows and screens.

What if the application intends to save different surfaces to different files, and wants to start appending to a new file whenever the user changes the source?

That application can add buttons for the user to do exactly that.

Why burden the user?

What if the user intends to save different surfaces to the same file, using this user agent feature?

Then that user will use an app that saves everything to the same file. I gave you examples of how this API could be used. Nowhere did I say that we'll be scouring the Internet and blocklisting any app that does not comport itself to my example.

eladalon1983 commented 1 year ago

Jan-Ivar, have you perhaps missed that this new API is opt-in, and that the default behavior will remain to NOT auto-pause?

eladalon1983 commented 1 year ago

I've rewatched the WebRTC WG January interim, Jan-Ivar, and I see you debated the shape with me. Now you seem to have forgotten the use-cases and are confused about the proposal. What has changed? Could you try rewatching my presentation there?

youennf commented 1 year ago

Following on today's call, here is some related feedback:

  1. It would be good to discuss the pros and cons of putting the API at source or track level. Having it at the source level is in general safer for the web app and simpler for the UA. Getting use cases benefiting of cloned tracks with one being paused and not the other would be useful.
  2. The proposed API shape could probably be reduced. For instance, events might not be well suited since they require an API to disable/enable firing them given their cost.

Assuming we want it at the track level, below API could fit the bill:

callback DisplaySurfaceChangeCallback = undefined(DisplaySurfaceChangeHandler handler);
interface DisplaySurfaceChangeHandler {
    // Call this method to suspend video frames for longer than the current callback task.
    undefined waitUntil(Promise<any> f);
};

partial interface MediaStreamTrack {
    // Callback should be executed after the corresponding onconfigurationchange event handlers.
    undefined setDisplaySurfaceChangeCallback(DisplaySurfaceChangeCallback? callback);
};

Additional note, we could even reduce the API shape with a callback of the form

callback DisplaySurfaceChangeCallback = Promise<any>();
jan-ivar commented 1 year ago

The use case as presented would seem solved by a simple event fired before a page is navigated in a captured tab. Why can't a user agent do that? Then pausing isn't necessary, and the API is super simple.

If we accept pausing is necessary, my preference would be to:

  1. Reuse the track mute and unmute UA events for this (relax the "black frame" requirement for video by some ms)
  2. Add some means for JS to distinguish the cause of the UA mute
  3. Add track.unmute() and align it with https://github.com/w3c/mediacapture-extensions/issues/39
  4. Put the opt-in on CaptureController.

This would accomplish several things:

youennf commented 1 year ago
  1. Reuse the track mute and unmute UA events for this

That makes sense if we go with a source level API, let's discuss this important design point first.

dontcallmedom-bot commented 1 year ago

This issue :

dontcallmedom-bot commented 1 year ago

This issue was mentioned in WEBRTCWG-2023-06-27 (Page 16)

eladalon1983 commented 1 year ago

As mentioned during the June meeting of the Screen Capture CG, the issue presented in this thread might be tackled along with a different one, using a mechanism other than auto-pause.

Other issue:

  1. Application calls getDisplayMedia() and obtains a MediaStream containing a video MediaStreamTrack vid.
  2. Application checks vid.getSettings().displaySurface and detects that the user chose to share a specific surface type (say 'monitor').
  3. User changes the capture to some other surface type (say 'window').
  4. Application attempts to call some method on vid, that behaves differently on the first surface type compared to the second. (Say throws an error.)

Reasonable applications deployed today are likely to only perform the check of surface type immediately after gDM resolves, and set cascading state based on it (modify user-facing controls that can later trigger step 4). Such applications would not have been written to check again before step 4, because dynamic switching across surface types did not yet exist.

The backwards-compatible solution I presented during the Screen Capture CG meeting is as follows:

const controller = new CaptureController();
controller.enableDynamicSwitching('switch', (event) => {
  const videoElement = document.getElementById('myVideoElement');
  videoElement.srcObject = event.mediaStream;
  // And/or outgoing peer connection; imagine whatever you find relevant.
});
const stream = await navigator.mediaDevices.getDisplayMedia({ controller });
...

The spec would then say that the user agent MAY constrain dynamic-switching options based on whether enableDynamicSwitching() was called (before gDM was called), thereby ensuring that cross-type surface-switching only happens to apps that can handle it gracefully.

Note how the solution provides auto-pause "for free" by virtue of creating a new stream (and new tracks).

eladalon1983 commented 1 year ago

During TPAC, in the joint SCCG / WebRTC WG meeting, I believe we started forming consensus on this topic (slides). Any more thoughts on the general API shape I presented, or shall we send a PR soon and iterate on specifics there? Tagging interested parties: @jan-ivar, @youennf

henbos commented 1 year ago

https://docs.google.com/presentation/d/1i0tZ1rRFh4Ibn3KxfEpHzEw6ixNKCgWjn1WgSEv01Dw/edit#slide=id.g20f65cf48b6_0_0

jan-ivar commented 1 year ago

I have some API nits (and agree we shouldn't overload addEventListener). I'd prefer:

const controller = new CaptureController();
controller.addEventListener("switch", ({stream}) => video.srcObject = stream);
controller.manualSwitching();
video.srcObject = await navigator.mediaDevices.getDisplayMedia({controller});

This is modeled on messagePort.start() which is _"only needed when using EventTarget.addEventListener; it is implied when using onmessage"_, which makes this equivalent:

const controller = new CaptureController();
controller.onswitch = ({stream}) => video.srcObject = stream;
video.srcObject = await navigator.mediaDevices.getDisplayMedia({controller});

I.e. manualSwitching() is implicit with onswitch.

While I'm coming around on the API, I don't like the idea of limiting this existing and useful end-user feature only to sites that opt-in. That seems like a loss for users. Hence the name manualSwitching() instead of enableDynamicSwitching().

I'd prefer if this new API came with language to encourage browsers to not gate the feature only on apps that opt-in to the event model.

I have reservations about abandoning the old injection model which had some unique advantages:

  1. Works with all websites without needing their support (user benefits)
  2. No need to replaceTrack downstream peer-connection sinks
  3. No risk of ending a MediaRecorder sink recording prematurely
  4. Difficult for apps to deny user switching (user benefits)

So I'm not convinced the event model is always better just because it wins solving the most advanced case (the tightly integrated crop-enabled VC+SLIDE app). For VC apps that DON'T have integrated SLIDE apps, it's not clear why the event API is needed (apart from audio which I propose a solution for below).

When it comes to audio (which I see as the lone flaw in the injection model for most VC sites), the following might work well enough in the injection model:

const stream = await navigator.mediaDevices.getDisplayMedia({audio: true});
video.srcObject = stream;
stream.onaddtrack = ({track}) => console.log(`audio added`);

The UA could subsequently mute the audio track as needed rather than remove it. Thoughts?

beaufortfrancois commented 1 year ago

Note that the 2 other occurrences I've found of doing something more in the setter onfoo are:

For the record, I'm not a huge fan of this as developers may read controller.onswitch = ... and think they can/should use controller.addEventListener('switch', ... as they've been educated to not use onfoo anymore and realize eventually their code doesn't work without understand why at first glance.

eladalon1983 commented 1 year ago

@jan-ivar, I think we have general alignment on the solution and are now discussing the specifics of the shape. That works for me. In the interest of clarity, I am splitting this response in two. This comment covers where we agree; the next comment covers an open issue for discussion.

Hence the name manualSwitching() instead of enableDynamicSwitching().

That name change works for me. Almost. I think enable... would make it clearer that it's a setter. So enableManualSwitching().

I'd prefer if this new API came with language to encourage browsers to not gate the feature only on apps that opt-in to the event model.

Works for me.

I have reservations about abandoning the old injection model which had some unique advantages:

  1. Works with all websites without needing their support (user benefits)
  2. No need to replaceTrack downstream peer-connection sinks
  3. No risk of ending a MediaRecorder sink recording prematurely

We are not currently planning to abandon the old model in Chrome's implementation. Thank you for documenting some of the reasons to retain it.

  1. Difficult for apps to deny user switching (user benefits)

In the interest of transparency, I'd like to remind you of surfaceSwitching. But please note that it's an orthogonal discussion. We are neither introducing surfaceSwitching nor deprecating it. The semantics of the new event listener would be that it has an effect if the user chooses to switch surfaces.

The UA could subsequently mute the audio track as needed rather than remove it. Thoughts?

I think that we have multiple challenges and a single API that solves them all, so I don't think we need an additional API that only solves a subset of the challenges (audio).

eladalon1983 commented 1 year ago

... it is implied when using onmessage"_, which makes this equivalent:

const controller = new CaptureController();
controller.onswitch = ({stream}) => video.srcObject = stream;
video.srcObject = await navigator.mediaDevices.getDisplayMedia({controller});

I.e. manualSwitching() is implicit with onswitch.

This seems less than ideal for me. I think it's a footgun for developers when onevent and addEventListener behave differently. (See also François' comment). But before we do a deep dive on this particular detail...

... I wonder if we should eschew both onswitch and addEventListener to begin with, because allowing the registration of multiple event listeners seems like a footgun in itself.

It seems better to shape the API as enableManualSwitching(handler). This ensures exactly one handler, neatly solving all of the these issues.

youennf commented 1 year ago
  1. Works with all websites without needing their support (user benefits)

Agreed, new tracks require extra work for the web developer, it is appealing to keep using the current track from this point of view. This would require an addtional API to allow delaying the generation of frames, typically as part of the onswitch event.

As of the source mutating, configurationchange event may be good enough to tell the web application that cropTo might no longer work for instance. This would require deprecating BrowserCaptureMediaStreamTrack as well.

The new track approach main benefit is that it is future proof however the model evolves. Another example that comes to mind is if switching from one window to several windows (hence several video tracks, with potentially several audio tracks?). In that case, applications will anyway need to deal with new video tracks if they want to leverage the user selection.

stream.onaddtrack

I am not a big fan, given many websites do tend to recreate MediaStreams on the fly or clone them. In the future, we might want to transfer MediaStream as well. The application would also need to stick to both CaptureController and the MediaStream for being notified of what is happening. Since we already have CaptureController, I tend to prefer having a single event telling the application what happened all at once.

eladalon1983 commented 1 year ago

Youenn, could you please clarify which approach you are favoring here? (Note that I am not proposing abandoning the old model. So any application that has a reason to stick to it, can still do so.)

jan-ivar commented 1 year ago

(answering multiple people)

I think enable... would make it clearer that it's a setter. So enableManualSwitching().

👍

... developers may read controller.onswitch = ... and think they can/should use controller.addEventListener('switch', ... ... and realize eventually their code doesn't work without understand why at first glance.

Yes, this literally happened to me with web workers! So I sympathize, but I did figure it out, and there IS precedence here, which may help devs... This might come down to one's view of onfoo vs addEventListener("foo", ) ...

as they've been educated to not use onfoo anymore

Where is this advice? I recall it years back, but worry it may be outdated. See § 7.4. Always add event handler attributes.

IMHO onfoo is fine in top-level app code ("top-level" = the level owning the target). E.g. with

const video = document.createElement('video');
video.srcObject = stream;
await new Promise(r => video.onloadedmetadata = r);

...there's no problem since video is not available outside this async function yet. Just avoid hiding it inside sub-functions and helper functions.

In this issue, I'm even hearing an argument for a single callback which (I don't agree with but) I suspect means controller.onswitch = f would likely satisfy 99% of cases.

... allowing the registration of multiple event listeners seems like a footgun in itself.

I think it's generally agreed that not allowing multiple listeners is the footgun. E.g. two areas (functions) of an app both set onfoo or setMy(callback) whatever, causing one to remove the other, a hard-to-debug action at a distance.

Lots of precedence says multiple events are OK. See §7.3. Don’t invent your own event listener-like infrastructure.

stream.onaddtrack

I am not a big fan, given many websites do tend to recreate MediaStreams on the fly or clone them. In the future, we might want to transfer MediaStream as well.

Sure, but the same is true of the MediaStream(s) managed by RTCRtpReceiver. But I find it interesting that the video.srcObject sink handles it seamlessly in all browsers. But I agree real-world sinks would likely need to take some action, e.g. with the track-based RTCPeerConnection:

stream.onaddtrack = pc.addTrack(track);

The application would also need to stick to both CaptureController and the MediaStream for being notified of what is happening.

This API solves the audio problem in the injection case, which doesn't require CaptureController. The question here probably comes down to whether we want to solve the audio injection case or not...

jan-ivar commented 1 year ago
  1. Difficult for apps to deny user switching (user benefits)

In the interest of transparency, I'd like to remind you of surfaceSwitching.

Yes, but that merely "signals whether the application would like the user agent to offer the user an option to dynamically switch". I.e. UAs are free to ignore it.

youennf commented 1 year ago

Topic was discussed yesterday and it seems there is some convergence. AIUI, the proposal is:

@eladalon1983, @jan-ivar, can you clarify whether this is a correct summary?

If this is correct, we should consider removing step 4 (closing the old tracks). The web application could decide to stick with the old tracks and stop the new tracks instead if it is thought more adequate.

This would allow easy migration in the simple cases (just do nothing for video, check for new audio). We could have a simple CaptureController onswitch event, given it would not have any side effect. We could in the the future extend the event with a waitUntil API if we think the web app needs more time to decide what to do before the UA starts generating media.

eladalon1983 commented 1 year ago

Thanks, Youenn; I believe this summary is mostly accurate. [*]

we should consider removing step 4 (closing the old tracks). The web application could decide to stick with the old tracks and stop the new tracks instead if it is thought more adequate.

This would allow easy migration in the simple cases (just do nothing for video, check for new audio).

I don't think this would work, because the old track would not get any new frames, even if it's ended. I'm also not sure what problem it's solving - if an existing application is already updated to call enableManualSwitching(), and it already does work on the new audio track, why can't it also do work on the video track? (In contrast to backwards compatibility with existing applications that don't update their code and never call enableManualSwitching(); it's preferable to ensure that these just keep working.)

We could have a simple CaptureController onswitch event, given it would not have any side effect.

Minor objection on my part here. I don't think I'd block on it, but I'd want a chance to argue against it, at a time that would not distract us away from the main discussion.


[*] With some nits. For example, I don't currently see why it's necessary to mandate that no new frames be emitted on the new track before the old one is ended. But I think that's a minor point that can be discussed later.

youennf commented 1 year ago

I don't think this would work, because the old track would not get any new frames, even if it's ended.

The point would be to delay sending frames to the old track until the event listeners are executed. This gives enough time for the web app to stop the old track or continue using it.

I see two main benefits:

  1. We could remove enableManualSwitching(). The old track would stay in the legacy mode, as if enableManualSwitching() would not have been called. The new track would be created as if enableManualSwitching() would have been called.
  2. This is more convenient to web developers as they might only have to deal with audio, video setup will stay the same.
eladalon1983 commented 1 year ago

I think we still want to go with a callback rather than an event, so as to avoid the issues with multiple tracks with the same source, and the interaction between multiple listeners.

With that, I don't really see how we could avoid enableManualSwitching(), or any method/attribute for setting the callback.

Once a callback is set - using whichever method of attribute - there is no way for the UA to determine if the application wish to receive a new audio track and retain the old video track. Not without additional API surfaces, at any rate, and those are past MVP in my opinion. We need to fire a new video track, and we need to stop delivering frames to the old one (or else they effectively share a source).

I think we have a clear and simple solution in:

This is future-proof:

youennf commented 1 year ago

we need to stop delivering frames to the old one (or else they effectively share a source).

I think this is the main remaining discussion point to me. Given UAs need to support this as part of the old behaviour, I do not really see the problem to keep doing this from a UA point of view. From an end-user point of view, both approaches are equal. Let's judge this from the point of view of web developers. Is it helping or is it confusing?

eladalon1983 commented 1 year ago

The difference is that in the legacy mode, you only have one track that receives frames. In the new model, if each event will keep delivering frames to the old tracks too, you'd end up with an arbitrary number of "old" tracks still getting frames.

This complicates things for all constituencies.

jan-ivar commented 1 year ago

My concern remains that this API becomes a way for apps to intentionally or accidentally block a useful existing feature (source-switching through track injection) — I think the accidental part can be fixed however by adjusting the API shape.

I'd prefer an API that centers the old injection behavior as the norm. Something like this:

const controller = new CaptureController;
video.srcObject = await navigator.mediaDevices.getDisplayMedia({video: true, audio: true, controller});
controller.onsourceswitch = event => {
  video.srcObject = event.stream;
  event.preventDefault();
};

This is a cancelable event that notifies the webapp that the source of the capture is about to change.

  1. The default action is to switch all tracks from the old source to the new one, and stop the tracks in event.stream.
  2. The webapp can prevent this with preventDefault(), which instead stops all tracks from the old source.

This absolves the webapp from stopping tracks in the common cases, avoiding issues with multiple listeners.

The webapp can inspect the new tracks before making a decision. E.g.

This would allow easy migration in the simple cases (just do nothing for video, check for new audio).

controller.onsourceswitch = event => {
  if (video.srcObject.getTracks().length == event.stream.getTracks().length) return;
  video.srcObject = event.stream;
  event.preventDefault();
};
eladalon1983 commented 1 year ago

My concern remains that this API becomes a way for apps to intentionally or accidentally block a useful existing feature

As discussed during TPAC, this concern is already addressed. User agent can offer dynamic-switching unconditionally, with setSourceSwitchCallback() only controlling whether this switching takes the form of source-injection or the new mechanism. My presentation during last TPAC proposed this as an option.

I'd prefer an API that centers the old injection behavior as the norm.

That's already the case. Any app that does not call setSourceSwitchCallback() would indeed get source-injection. No changes are required in that vein. You suggest defaulting to source-injection even when setCallback() is called. What are the benefits of this added complexity?

(Note that this is a quote:)

controller.onsourceswitch = event => {
  if (video.srcObject.getTracks().length == event.stream.getTracks().length) return;
  video.srcObject = event.stream;
  event.preventDefault();
};

The following is even simpler:

controller.setSurfaceSwitchingCallback((stream) => {
  video.srcObject = stream;
});
jan-ivar commented 1 year ago

My concern is not addressed. A user agent can trivially limit source-switching (and its implementation burden) to the non-injection kind by limiting the feature to apps that have called setSourceSwitchCallback ahead of time.

It's harder to do that with controller.onsourceswitch, because a user agent cannot know ahead of time if the app will call event.preventDefault().

Conversely, apps have more freedom with controller.onsourceswitch in that they can get notified about switching (including injection) ahead of time without pre-committing to a decision.

Case in point:

(Note that this is a quote:)

controller.onsourceswitch = event => {
  if (video.srcObject.getTracks().length == event.stream.getTracks().length) return;
  video.srcObject = event.stream;
  event.preventDefault();
};

The following is even simpler:

controller.setSurfaceSwitchingCallback((stream) => {
  video.srcObject = stream;
});

These two code snippets do different things. The former illustrates what the latter cannot do (a deferred decision).

eladalon1983 commented 1 year ago

A user agent can trivially limit source-switching (and its implementation burden) to the non-injection kind by limiting the feature to apps that have called setSourceSwitchCallback ahead of time.

Typically, when we think user agents must not do something, we specify that they MUST NOT do it. I cannot think of a precedent where we chose a more complex API shape in order to tie the UA to the mast, so to speak. Can you?

Conversely, apps have more freedom with controller.onsourceswitch in that they can get notified about switching (including injection) ahead of time without pre-committing to a decision.

This flexibility is not free.

What benefits offset these costs? Can you give a concrete example of an application that needs to make the decision to source-inject or not on a dynamic basis?

jan-ivar commented 1 year ago
  • It adds a side-effect to the onsourceswitch setter, which is an anti-pattern.

What side-effect? Cancelable event is an established pattern, and this works just as well:

controller.addEventListener("sourceswitch", event => {
  video.srcObject = event.stream;
  event.preventDefault();
});
  • It requires more code from the app - even if the app doesn't want to use this feature. (Foreshadowing.)

Please explain.

  • It might be error-prone for the app, which might fail to call preventDefault() on some branches.

This seems to discount legitimate reasons for not calling it. E.g an app just wanting to be notified of switching.

I think it's much simpler to explain that:

  1. There's a cancelable sourceswitch event that fires on source switching
  2. Apps can prevent the default switch and grab the fired stream instead

To compare, please explain in one sentence what setSurfaceSwitchingCallback does. It glues two things together, forcing extra work on an app just to be notified. I therefore find it the more complicated API to understand — e.g. someone trying to reverse-engineer its meaning from reading JS — as well as being more limited in what functionality it offers. It also violates § 7.3. Don’t invent your own event listener-like infrastructure.

Can you give a concrete example of an application that needs to make the decision to source-inject or not on a dynamic basis?

Can you give a concrete example of why an application must make this decision to source-inject or not upfront?

@youennf mentioned migration in https://github.com/w3c/mediacapture-screen-share-extensions/issues/4 "The web application could decide to stick with the old tracks and stop the new tracks instead if it is thought more adequate. ... This would allow easy migration in the simple cases (just do nothing for video, check for new audio)" which is the example I gave.

He also said: "We could have a simple CaptureController onswitch event, given it would not have any side effect."

My API accomplishes both of those.

youennf commented 1 year ago

The event API shape seems to hit a sweet spot to me. Having an event is more extensible (say waitUntil in case migration should be done asynchronously).

It might be a little surprising that calling preventDefault() actually stops all previous tracks but I think that is fine. It seems for instance better than the possibility we discussed of adding a boolean option to setSurfaceSwitchingCallback for instance.

We should dig into the actual proposal more to further evaluate Elad's concerns. For instance, the spec should state that the old tracks migrates (and thus only start receiving new frames) after the event was fired. This begs the question of when configurationchange is fired for instance. The migration logic can probably remain in the JS process, provided the switch notification happens before the sending of the new video frames.

tovepet commented 1 year ago

There are limitations on which types of surface switching a user agent can perform without the help from the application, e.g., adding audio requires the application to act on the change for it to take effect.

Since the user agent is expecting the application to perform an action in this case, the API should be designed to reflect that. By giving the application a new MediaStream that it needs to use, there is less risk of an application writer, e.g., forgetting to check if an audio track has been added.

It is also important for the user agent to know in advance if the application intends to act on a surface change so that it can offer the appropriate options, e.g., it should not give the option to add audio if the application is going to ignore that. Setting a callback, can provide this information to the user agent.

I think there might be a dividing line in this discussion between if we primarily look at this functionality as a way for applications to perform the required setup for when the mediastream changes in a material way, or if it is about getting notified about a change having occurred. Perhaps we want to have both a callback for the former purpose and an event for the latter?

eladalon1983 commented 1 year ago

I completely agreed with @tovepet.

[@eladalon1983] It requires more code from the app - even if the app doesn't want to use this feature. (Foreshadowing.)

[@jan-ivar] Please explain.

See my code vs. yours in this comment.

[@eladalon1983] Can you give a concrete example of an application that needs to make the decision to source-inject or not on a dynamic basis?

[@jan-ivar] Can you give a concrete example of why an application must make this decision to source-inject or not upfront?

I don't think that's a reasonable way to respond to my question, or to represent my position.

You represent my position as "some applications must make this decision upfront." I have never said that, nor could I have said it; clearly, any decision an application could make upfront, it could still deliver to the user-agent at a later time-point with no great hardship.

Rather, I claimed the application is able to make the decision upfront - which greatly simplifies the spec, the UA-implementation and the app's own code.

So I reiterate my question - what do we lose by not going with this simpler model? What application needs the extra functionality?

[@youennf] We should dig into the actual proposal more to further evaluate Elad's concerns.

Agreed. Let's dig into what issue is solved by allowing applications to dynamically decide whether source-injection happens, rather than deciding ahead of time. We should only go with more complicated solutions if they serve an extended purpose.

[@youennf] For instance, the spec should state that the old tracks migrates (and thus only start receiving new frames) after the event was fired. This begs the question of when configurationchange is fired for instance.

With the non-injection-model, we (1) stop delivering frames, (2) invoke the callback and (3) formally stop the old tracks. I don't think it's in-scope to specify what the source-injection model does. I think that's a follow-up.

jan-ivar commented 1 year ago

Since the user agent is expecting the application to perform an action in this case, the API should be designed to reflect that.

A user agent cannot expect an application to perform an action. Rather, applications tell APIs what to do.

It is also important for the user agent to know in advance if the application intends to act ... Setting a callback, can provide this information to the user agent.

The user agent cannot know if the callback handles audio. The only way to know is through opt-in.

Having controller.setSurfaceSwitchingCallback = () => {}; enable image violates §7.3. Don’t invent your own event listener-like infrastructure which says: "Create separate API controls to start ... the underlying process."

E.g. what happens if the callback is set after getDisplayMedia? Does it race, or does the UX update?

§7.3. compels us to consider a more declarative API if advance UA knowledge is important. E.g.:

await navigator.mediaDevices.getDisplayMedia({audio: true, surfaceAudioSwitching: "include"});
eladalon1983 commented 1 year ago

E.g. what happens if the callback is set after getDisplayMedia? Does it race, or does the UX update?

setCallback() throws if CaptureController.[[IsBound]]. This is desirable behavior that can be achieved with a callback, but would be surprising with an event listener.

jan-ivar commented 1 year ago

It might surprise users that they can't update the callback later, which is why §7.3. Don’t invent your own event listener-like infrastructure says to instead "use the existing event infrastructure to allow listening for the notifications".

eladalon1983 commented 1 year ago

When you say "users" here, I believe you mean "Web developers."

Please refer to §7.8. Use Events and Observers appropriately, which reads (emphasis mine):

In general, use EventTarget and notification Events, rather than an Observer pattern, unless an EventTarget can’t work well for your feature.

It also says:

If using events causes problems [...] consider using an Observer pattern instead.

The benefits of using a callback rather than an event listener have previously been listed on this thread as well as during discussions in the WG and in TPAC. To name just one - multiple event handlers would be problematic if the first stops the track and the second handler receives a stopped track. (There were more benefits.)

jan-ivar commented 1 year ago

There's no proposal to use Observers so §7.8 seems irrelevant. Even if it were relevant, it says to prefer events, since CaptureController is already an EventTarget.

... multiple event handlers would be problematic if the first stops the track and the second handler receives a stopped track.

Nothing wrong with that.

jan-ivar commented 1 year ago

To clarify from my proposal in https://github.com/w3c/mediacapture-screen-share-extensions/issues/4:

This absolves the webapp from stopping tracks in the common cases, avoiding issues with multiple listeners.

eladalon1983 commented 1 year ago

There's no proposal to use Observers so §7.8 seems irrelevant.

The relevance is that although §7.3 warns against reinventing events, §7.8 acknowledges that events aren't always the best option, and opens the door to judicious use of other patterns. I believe callbacks are such an established pattern.

... multiple event handlers would be problematic if the first stops the track and the second handler receives a stopped track.

Nothing wrong with that.

It seems like a footgun to me. Perhaps an even clearer footgun is constraints, where one listener could affect what another listener receives - unless we specify that these are separate tracks? What precisely would the relation be between the different tracks fired for different events-listeners? Clones? References to the same tracks? Independent tracks with the same source?

On another topic - I have asked a question multiple-times and only got a non-answer. I think it is an important question. Namely - what application needs to decide dynamically whether it wants to source-inject or not? I think this is the primary question, and the question of events vs. callbacks is secondary. The answer to the primary question could end up influencing our reasoning on the secondary.

youennf commented 1 year ago

It seems like a footgun to me.

There are examples of this pattern and it does not seem to cause issues to web developers. For instance, RTCPeerConnection.ontrack, ServiceWorkerGlobalScope.onfetch, MediaStream.onaddtrack.

Following on those existing APIs, these would probably reference the same track objects (although we could expose a method if we really wanted separate tracks).

what application needs to decide dynamically whether it wants to source-inject or not?

From a web app point of view, being able to decide at point of use, and not at point of registration, is usually a good thing. Say the web application is broadcasting via WebRTC and optionally (via a user toggle) recording the local screen. The web application could for instance decide to use source-injection for WebRTC and new tracks for local recording. Or it could decide to use source-injection if switching from one screen to another, but would not do this if switching from/to a window.

A web developer can optimise a bit potentially by telling it would never call preventDefault() via passive. Also, the event-based approach is better from an extensibility point of view. These are examples of advantages of reusing the existing Event based ecosystem.

These advantages should be compared to spec/implementation complexity. I think the major complexity here is to support both models, which we seem to agree on. The point in time where to select the model can bring some complexity, probably limited though.

eladalon1983 commented 1 year ago

Thanks, @youennf, for presenting use cases. I note that these examples don’t demonstrate genuine need by the application for source-injection; these are hypothetical examples about why some applications might use it. These applications would be able to accomplish the same functionality with the new model with no great difficulty, and I’d argue that it would even produce simpler and more straightforward code. (Conditionally suppressing defaults is not as common a pattern as “old thing replaced by new thing - plug it into srcObjects and peer-connections where it's still useful.”)

If others are not worried about the footguns of interaction between different event-handlers, I am willing to let that be and go with events, provided that:

eladalon1983 commented 1 year ago

Thinking about it again - if we use source-injection and someone registered an EventHandler for surface-switching, what streams/tracks does the event even reference? The original ones returned by gDM some seconds/minutes/hours ago? Those which might since have been cloned and stopped?

Using the callback-approach provided a good answer here - we simply wouldn't get notifications when source-injection happened. I think this was much better than trying to maintain support for the old injection model together with the new events.

youennf commented 1 year ago

what streams/tracks does the event even reference?

We cannot really reuse the same MediaStream, otherwise local preview might start playing audio for instance. I think it would be new MediaStream(s) / MediaStreamTrack(s).

  • The decision to source-inject is finalized when calling gDM.

That is the main discussion point and there are various approaches outside of callbacks/events. Is the following a fair summary?

  • All handlers get references to the same new stream and track(s), rather than to clones or independent tracks.

I think this is a given.

Additional question related to how switching would actually work. Say we have the following case:

In that case, we are expecting V1 to get the content of V2, V3 and V4. I am unclear about A1 and A2. Are we expecting that A1 does get migrated or not? What if A1 was exposed as part of gdm call? If migrating, what happens to A1 when V3 is live? It might be best to reduce the scope to only supporting migrating the first video track generated by gdm.

eladalon1983 commented 1 year ago

Is the following a fair summary?

  • @eladalon1983 prefers early decision (simpler to implement, no real web dev need)
  • @jan-ivar prefers late decision (more convenient for web devs, prevent UA from not showing switch UI)

Not completely fair, because it was never claimed that Web developers "need" early-decision; it was claimed that it's "sufficient" for them while making things easier for everyone else.

But I am willing to lean in and actually claim it. Yes, developers need the early-decision, because cross-surface-type(!) source-injection is a footgun. Consider the code in this fiddle: https://jsfiddle.net/eladalon/Ly8a3wcs/

The "catch" here is that existing apps record app-state immediately after gDM resolves, and they never re-examine this state, because for several years now, this was good enough, since the surface-type could not change past that point.

Additional question related to how switching would actually work. Say we have the following case:

  • gdm, only one video track V1
  • switch -> one video track V2 and one audio track A1.
  • switch -> one video track V3
  • switch -> one video track V4 and one audio track A2.

In that case, we are expecting V1 to get the content of V2, V3 and V4. I am unclear about A1 and A2. Are we expecting that A1 does get migrated or not? What if A1 was exposed as part of gdm call? If migrating, what happens to A1 when V3 is live? It might be best to reduce the scope to only supporting migrating the first video track generated by gdm.

All of this would be trivially and predictably handled with a setSwitchCallback() approach that does not offer source-injection. (That is - source-injection only offered if setSwitchCallback() isn't called.)

tovepet commented 11 months ago

I have uploaded a PR for the switch-track model (using a callback and early decision): https://github.com/w3c/mediacapture-screen-share/pull/289

I also have a slot booked for the Dec 12 working group meeting to discuss this change and any remaining open questions from this thread.