Auto-pause capture when user switches captured content

eladalon1983 commented 1 year ago

Both Chrome and Safari now allow the user to change what surface is captured.

I was too lazy for alt-text.

That's obviously great stuff. Can we make it better still?

What if an application intends to do some processing that depends on the captured content?
What if an application wants to set different constraints, for instance when capturing a window vs. when capturing a screen?
What if the application intends to save different surfaces to different files, and wants to start appending to a new file whenever the user changes the source?

So I propose that we add two things:

A control which allows an application to instruct the browser - whenever the user changes what surface is shared, pause the track (by setting enabled to false).
Fire an event whenever this happens. (With the general expectation that the application will apply new processing and then set enabled back to true.)

Possibly the two can be combined, by specifying that setting an event handler signals that the pausing behavior is desired (@alvestrand's idea).

Another natural extension of this idea is to also apply it when a captured tab undergoes cross-origin navigation of the top-level document. When that happens, some applications might wish to stop capture momentarily and (in-content) prompt the user - "do you still want to keep sharing?"

Relevant previous discussion here.

jan-ivar commented 12 months ago

To recap my view from today's editors' meeting, I see 3 things to decide on (with my preferred answers):

declarative opt-in (bikeshed getDisplayMedia({audio: true, appAssistedSurfaceSwitching: "include"}))
notification regardless (sourceswitch event)
app decision-point (late aka point-of-use through event.preventDefault())

eladalon1983 commented 12 months ago

app decision-point (late aka point-of-use through event.preventDefault())

I'd still like to see an example of an application that benefits of this possibility.

jan-ivar commented 11 months ago

I'd still like to see an example of an application that benefits of this possibility.

In today's meeting the early decision example shown was:

getDisplayMedia({appAssistedSurfaceSwitching: "include", …})
controller.onsourceswitch = event => {
  video.srcObject = event.stream;
};

But this will glitch in all browsers, even for same-type switching, because it reruns the media element load algorithm.

A late decision seems inherently needed to fix this glitch for the subset of same-type switching. E.g.

controller.onsourceswitch = event => {
  if (!compatibleTypes(video.srcObject, event.stream)) {
    event.preventDefault(); // Use switch-track model
    video.srcObject = event.stream;
  }
};

Glitching may similarly happen with other sinks, like MediaRecorder or WebAudio.

eladalon1983 commented 11 months ago

I don't fully understand what is being asserted here. A clarification would be welcome.

I also note this interesting bit: (Emphasis mine.)

A late decision seems inherently needed to fix this glitch for the subset of same-type switching.

Does that mean you support dropping the late-decision requirement for non-same-type switching?

dontcallmedom-bot commented 11 months ago

This issue was discussed in WebRTC December 12 2023 meeting – 12 December 2023 (Dynamic Switching in Captured Surfaces)

eladalon1983 commented 9 months ago

But I am willing to lean in and actually claim it. Yes, developers need the early-decision, because cross-surface-type(!) source-injection is a footgun. Consider the code in this fiddle: https://jsfiddle.net/eladalon/Ly8a3wcs/

I can now substantiate this claim in a more persuasive manner. Try out captured-surface-control.glitch.me using Chrome Beta/Canary. Observe:

If you choose a window, the application stops the capture and bids you try again.
If you choose a tab, the application "activates" and there is a built-in assumption that you keep on sharing a window.

Applications built before cross-surface-type source-switching was possible had no reason to expect that getSettings().displaySurface might be mutable, and they are not robust to these changes.

-- Note: Of course Captured Surface Control is not a standard API. Assume for the sake of argument that it never will be. The whole point here is to show that in the future, we could credibly add APIs that would work for one surface types but not for others, and that unexpected switching would break apps.

jan-ivar commented 9 months ago

I don't fully understand what is being asserted here. A clarification would be welcome.

It's asserting that injection and its alternative have different side-effects, and which ones an app prefers might differ based on what surface the end-user chose to switch to/from (e.g. whether both or neither have audio). E.g.

If the sink is MediaRecorder, injection allows continuing to record to the same (e.g. video only) file, but it remains an app decision whether to do so (perhaps denoting a chapter in some metadata), or to sometimes or always switch recording to a new (e.g. audio and video) file
If the sink is RTCRtpSender, injection allows immediate switching ~with replaceTrack~ without waiting for a renegotiation round-trip, but it remains an app decision whether to take advantage of this and how to handle edge-cases (add audio)

A late decision seems inherently needed to fix this glitch for the subset of same-type switching.

Does that mean you support dropping the late-decision requirement for non-same-type switching?

I've seen no proposal for how an app might specify its preferences for the different surfaces a user might pick up-front, but am happy to compare complexity of anything presented.

Applications built before cross-surface-type source-switching was possible had no reason to expect that getSettings().displaySurface might be mutable, and they are not robust to these changes.

How are they not robust to these changes? Do you have an example that is not experimental?

eladalon1983 commented 8 months ago

It's asserting that injection and its alternative have different side-effects, and which ones an app prefers might differ based on what surface the end-user chose to switch to/from (e.g. whether both or neither have audio).

Thanks, now I understand.

Theoretically speaking - I agree completely. But do we have a concrete example of such an app? Are there any apps that decide whether to use MediaRecorder vs. RTCRtpSender based on whether the user shared a window vs. a screen? I am not aware of such apps, and I'd actually be quite surprised if you could name such an app. All apps I know make the decision - when there even is a decision to be made - before invoking getDisplayMedia(). I think it's important that we solicit actual developer feedback and only introduce complexity that serves genuine needs.

I've seen no proposal for how an app might specify its preferences for the different surfaces a user might pick up-front, but am happy to compare complexity of anything presented.

I don't think that's relevant. I believe the previous paragraph of my present comment explains why.

How are they not robust to these changes? Do you have an example that is not experimental?

Meet displays a scrim over the local preview of shared windows/screens, but not over the local preview of shared tabs.
I believe the example of the experimental API was compelling and deserves our attention.

jan-ivar commented 8 months ago

Are there any apps that decide whether to use MediaRecorder vs. RTCRtpSender based on whether the user shared a window vs. a screen?

I think there's a misunderstanding. I gave two examples of apps that may need late decision on injection vs new tracks:

a MediaRecorder example (same file vs new file)
an RTCRtpSender example (immediate vs wait for renegotiation round-trip)

I was NOT suggesting a single app might choose between a MediaRecorder or an RTCRtpSender sink. I would indeed struggle to find a concrete example of that. 😉

jan-ivar commented 8 months ago

... only introduce complexity that serves genuine needs

By complexity do you mean functinality? The best API matches the complexity of the functionality exposed. We can observe the natural complexity here by separating concerns:

apps want to learn when the user switches source → they register for the sourceswitch event
UAs may wish to hold back UX options that might not work → they look for explicit app opt-in through getDisplayMedia({appAssistedSurfaceSwitching: "include", ...}))
Downstream symptoms might dictate when injection vs. new tracks is preferable, which can differ based on what the user chose → event.preventDefault() = don't inject, I'll handle it

These are mostly orthogonal. I.e. we can imagine apps wanting 1 without 2 or 3, and UAs concern that apps own the user problem is nicely separated from the app's downstream needs, avoiding the fallacy that injection can't or won't work in many cases still.

This offers the most functionality to webpages, including already-shipped functionality (injection).

Compare this to DisplaySurfaceChangeCallback which ties 1, 2, and 3 together. I.e.

apps that want to learn when the user switches source cannot do so without registering a global callback AND writing code to handle new tracks, potentially suffering downstream symptoms like glitches or separate recording files, even for sources that should have worked

Forcing apps to opt-out of all injection to opt-in to more UA switching no doubt simplifes UA code, by offering less functionality. But less functionality doesn't seem like a user win.

tovepet commented 7 months ago

The web developers I have talked to have all preferred the predictability of having a new track for each captured surface over the convenience of the injection model. I don’t think this should be relegated to a secondary use case with extra hoops to jump through.

So let’s see if we can find a way to make both the switch-track model and the injection model easy and straightforward to use, and also provide some more flexibility in how they are applied.

One option could be to provide both of these track-types in parallel:

Surface-tracks with life-time and and functionality tied to a single captured surface
Session-tracks covering a whole capture-session, switching from one underlying surface-track to the next.

The API could look something like this:

controller.onnewsource = event => {
  video1.srcObject = event.stream; // surface tracks
};
const sessionStream = await getDisplayMedia({controller, /*opt-in*/, ...}); 
video2.srcObject = sessionStream;

where video1 would be using the switch-track model and video2 would be using the injection model. (The onnewsource event would be sent for all new surfaces including the initial one)

This API has the following benefits:

Surface-agnostic applications can use the injection-model without any extra work beyond opt-in.
Tracks tied to a single surface are provided for applications that are sensitive to surface-types without the need for calling preventDefault.
An application can choose to use the surface-tracks or session-tracks at event-time if needed.
An application can use injection for some tracks while switching out other tracks (this is not possible with the all-or-nothing preventDefault method).

What do you think? Could something like this better cover the different usages of the API that we have been considering?

jan-ivar commented 7 months ago

where video1 would be using the switch-track model and video2 would be using the injection model.

I like this idea of exposing both to the application and letting it use the one it prefers. It seems neutral and would let us measure over time whether apps find injection desirable, while remaining backwards compatible.

With preventDefault() I was hung up on the UA needing to stop one or the other right away, but if we don't need that then it simplifies.

My question would be what are the semantics now of calling video2.srcObject.getVideoTracks()[0].stop()? Would it also stop video1.srcObject.getVideoTracks()[0] or not (and vice versa)?

If no, the ~hardware light~ UA's privacy indicator UX might stay on for several seconds after a user clicks stop (until GC) in today's apps unaware of the newsource event.
If yes, we've created a new "either-or" track, where stopping one stops both, which could confuse apps.

Running with 2 for a bit, maybe we just fire ended on the other track and call it a special case?

youennf commented 7 months ago

Option 1 makes sense to me, UA will likely optimize the case of no event handler for newsource

tovepet commented 7 months ago

The way I conceptualize these options about whether stop should affect just one or both tracks is as follows:

Session tracks and surface tracks behave as clones with respect to each other, so stop would only affect the track on which it is called.
Session tracks are proxy tracks to the underlying surface tracks. Operations done on session tracks will also affect the corresponding underlying surface tracks and vice versa. Calling stop on one of the tracks would then affect both tracks.

If we choose to treat them as clones (option 1), I think that rather than introducing a special case, it’s better to allow the application to choose which tracks to receive through the opt-in, e.g.:

surfaceSwitchingMethods: [“inject”] to only receive session-tracks.
surfaceSwitchingMethods: [“replace”] to only receive surface-tracks.
surfaceSwitchingMethods: [“inject”, “replace”] to receive both types of tracks.

That would avoid creating the extra cloned track in the first place for applications that are only interested in either session tracks or surface tracks. It also does not add any extra burden on application writers since they would need to specify an opt-in anyway.

Option 1 makes sense to me, UA will likely optimize the case of no event handler for newsource

I don’t think this optimization would work in the other direction, i.e., for applications that are only interested in surface tracks.

jan-ivar commented 7 months ago

Note I inadvertently wrote "hardware light" among my concerns above, but of course this is screen-capture not camera/mic, so the only user-observable side-effect of an unstopped track would be the prolonged appearance of whatever privacy indicators the browser shows for a couple extra seconds until GC happens (e.g. after a user clicks stop).

Option 1 makes sense to me, UA will likely optimize the case of no event handler for newsource

I don’t think this optimization would work in the other direction, i.e., for applications that are only interested in surface tracks.

That's seems fine, as this optimization would be there to solve today's apps unaware of the newsource event.

In contrast, apps uninterested in session tracks can simply stop them once they've received new surface tracks:

const sessionStream = await getDisplayMedia({controller, /*opt-in*/, ...});
video.srcObject = sessionStream;
controller.onnewsource = ({stream}) => {
  video.srcObject.getTracks().forEach(track => track.stop());
  video.srcObject = stream; // surface tracks
};

So there doesn't seem to be much need for new stop semantics, which seems nice.

tovepet commented 7 months ago

Having to manually stop tracks is just the type of gotchas that I think we should strive hard to avoid when possible. It’s way too easy for a developer to miss, leading to lingering privacy indicators disconcerting users.

In this case the cost to fix the issue is also next to zero for applications that do not need to use both the injection and switch-track model. (I expect this to be the vast majority of applications.)

Compare:

controller.onnewsource = ({stream}) => {
  video.srcObject = stream; 
};
await getDisplayMedia({controller, surfaceSwitchingMethods: [“replace”], ...});

to

controller.onnewsource = ({stream}) => {
  video.srcObject = stream;
};
const sessionStream = await getDisplayMedia({controller, someOtherOptIn: “include”, ...});
sessionStream.getTracks().forEach(track => track.stop());

The former is both less code and less error-prone than the latter.

jan-ivar commented 7 months ago

With the optimization @youennf proposed, forgetting stop() seems like an existing problem.

Having apps explicitly stop() tracks they're done is the web model today, which makes its side-effects well-established, predictable, and pilot errors easy to diagnose and fix.

I'm not convinced introducing custom stopping-policies into the mix simplifies that responsibility.

controller.onnewsource = ({stream}) => {
  video.srcObject = stream;
};
const sessionStream = await getDisplayMedia({controller, someOtherOptIn: “include”, ...});
sessionStream.getTracks().forEach(track => track.stop());

The former is both less code and less error-prone than the latter.

Ah, I missed earlier you said the event would fire for all new surfaces "including the initial one"! Having apps immediately stop tracks from getDisplayMedia() does look weird indeed.

I like the session vs surface behaviors, but why do web developers need to pick between two types of tracks? This seems to artificially put injection off the table on subsequent switches once non-injection is chosen just once, for no apparent or inherent reason.

I'd like to propose a more fluid model where web developers doesn't need to care about this on the initial getDisplayMedia call, and every track remains a candidate for injection:

To inject everything (the UA optimizes stopping tracks surfaced in sourceswitch):

video.srcObject = await getDisplayMedia({controller, /*opt-in*/, ...});

To never inject:

video.srcObject = await getDisplayMedia({controller, /*opt-in*/, ...});
controller.onsourceswitch = ({stream}) => {
  video.srcObject.getTracks().forEach(track => track.stop()); // stop old
  video.srcObject = stream;
};

To selectively inject:

video.srcObject = await getDisplayMedia({controller, /*opt-in*/, ...});
controller.onsourceswitch = ({stream}) => {
  if (tracksAreCompatible(video.srcObject, streams)) {
    stream.getTracks().forEach(track => track.stop()); // stop new
  } else {
    video.srcObject.getTracks().forEach(track => track.stop()); // stop old
    video.srcObject = stream;
};

dontcallmedom-bot commented 7 months ago

This issue had an associated resolution in WebRTC April 23 2024 meeting – 23 April 2024 (Captured Surface Switching):

RESOLUTION: more discussion is needed on the lifecyle of surface tracks

youennf commented 2 months ago

The onsourceswitch or onnewsource approach seems sufficient to me to support both switch and injection models.

The small feedback I would give is that having these as events might not be great. A callback might be better instead so that there is only one receiver that is responsible to deal with it, for instance closing the new stream/tracks.

Something like captureController.processSourceSwitch(stream => { ... }); or captureController.processSourceSwitch(null);

tovepet commented 2 months ago

@jan-ivar:

With the optimization @youennf proposed, forgetting stop() seems like an existing problem.

Having apps explicitly stop() tracks they're done is the web model today, which makes its side-effects well-established, predictable, and pilot errors easy to diagnose and fix.

It makes sense for the application to be responsible to stop a track that it has requested, but in this case the UA throws an extra track on the application that that the application doesn’t want. It seems wrong to me to force application writers to stop this extra track that they never asked for.

I'm not convinced introducing custom stopping-policies into the mix simplifies that responsibility.

There is no new custom stopping policy. The surface track is bound to a specific surface, and it ends when the user switches away from that surface, since no more media will be delivered from that surface.

It’s the same behavior as when the user stops the capture of a surface.

Ah, I missed earlier you said the event would fire for all new surfaces "including the initial one"! Having apps immediately stop tracks from getDisplayMedia() does look weird indeed.

I like the session vs surface behaviors, but why do web developers need to pick between two types of tracks?

If we consider the following two classes of applications:

Capture-interacting applications that need to react to surface-switching or interact with surface-specific APIs.
Capture-agnostic applications that just want to capture what the user wants and don’t care what it is.

What I tried to achieve with this proposal was to have tracks bound to individual surfaces for capture-interaction applications while retaining the ease of use of the injection model for capture-agnostic applications. I believe the pure switch-track model is the easiest and least error-prone model for capture-interacting applications.

Overall, I think I’ve seen three different solutions to the stopping problem so far:

Let the session track be a proxy track for the surface tracks so it doesn’t need to be stopped independently.
Let session tracks and surface tracks be independent and let the application ask for the types of tracks it wants to receive
Let session tracks and surface tracks be independent, provide both to the application and let the application stop the one it doesn’t want (unless @youennf’s optimization applies).

I think option 1 and 2 are interesting to explore, while option 3 looks less attractive.

@youennf:

The onsourceswitch or onnewsource approach seems sufficient to me to support both switch and injection models.

I think they can be, but I don’t think we have yet found an API-shape that we all agree on, so that’s why I explore other options.

The small feedback I would give is that having these as events might not be great. A callback might be better instead so that there is only one receiver that is responsible to deal with it, for instance closing the new stream/tracks.

Something like captureController.processSourceSwitch(stream => { ... }); or captureController.processSourceSwitch(null);

I’m fine with this.

youennf commented 2 months ago

The switch and injection models are roughly equivalent to me for applications that are ok reacting synchronously to a switch change.

When the reaction is asynchronous (say applying region capture), I am not sure one of the presented model is more suited (VideoTrackGenerator to the rescue maybe).

Wrt option 2 and 3, they are not mutually exclusive with the callback approach:

No need for the previous option 3 optimization: no callback => new tracks are stopped/never existed.
We start simple (captureController.processSourceSwitch(callback)) and extend the API when we are ready.

Isn't option2 somehow equivalent to one of these option 2 extensions ?

captureController.processSourceSwitch(callback, { mode: 'stop-previous-tracks' })
captureController.processSourceSwitch(stream => { ...; return 'stop-previous-tracks'; }) // synchronous decision and video frames flowing
captureController.processSourceSwitch(async stream => { await...; return 'stop-previous-tracks'; }) // asynchronous decision and video frames flowing

tovepet commented 1 month ago

@youennf:

...

We start simple (captureController.processSourceSwitch(callback)) and extend the API when we are ready.

Isn't option2 somehow equivalent to one of these option 2 extensions ?

captureController.processSourceSwitch(callback, { mode: 'stop-previous-tracks' })

In the case of requesting surface tracks, it's equivalent to this one.

So, if I understand you correctly, you think we could start with something like the following?

An application can register a callback and specify the stop-previous-tracks-mode:

captureController.processSourceSwitch(callback, { mode: 'stop-previous-tracks' });
getDisplayMedia({captureController, …});

And then, when a user selects another surface, the UA will:

stop the tracks for the previous surface
call the callback with a new stream for the surface the user has selected.

This sounds good to me.

youennf commented 1 month ago

I am hoping we can quickly reach consensus on captureController.processSourceSwitch(callback) without any option for now. That would allow us to define the model and this method very quickly in the spec/UAs to help web developers.

I am not sure we have reached consensus yet on which options to expose and how to expose them. Hence why I am proposing this two steps approach, where we know we can easily go from step 1 (no options) to step 2 (with options). The idea would be to continue step 2 discussions while doing step1 spec/implementation work.

tovepet commented 1 month ago

Sounds like a plan!

I uploaded a PR last year along these lines: https://github.com/w3c/mediacapture-screen-share/pull/289

(I called the method setDisplaySurfaceChangeCallback, and I do think it is more of a setter than a process-method, but I'm open to discuss other names)

Please take a look!

jan-ivar commented 1 month ago

A callback might be better instead so that there is only one receiver that is responsible to ... closing the new ...tracks.

It makes sense for the application to be responsible to stop a track that it has requested, but in this case the UA throws an extra track on the application that that the application doesn’t want. It seems wrong to me to force application writers to stop this extra track that they never asked for.

Forgetting stop seems a problem in all the proposals.

What if the UA stopped tracks synchronously after the callback/event-handler, requiring JS that wants to use a track to clone it?

@youennf is the stop-problem the only issue driving you to prefer callbacks over events?

jan-ivar commented 1 month ago

My thinking is a sourceswitch event that fires whenever the user switches source (with no requirement to stop tracks) might be useful even to capture-agnostic applications. E.g. to disambiguate configurationchange events fired on its tracks.

Something like captureController.processSourceSwitch(stream => { ... }); or captureController.processSourceSwitch(null);

I’m fine with this.

Note the session vs surface tracks distinction won't hold here. E.g.

video.srcObject = (await new Promise(r => controller.processSourceSwitch(r)).stream;
controller.processSourceSwitch(null);
// the tracks in video.srcObject are now surface tracks yet subject to injection

How is the surface/session distinction meaningful to web developers?

jan-ivar commented 1 month ago

While writing the above code example, I found no way to await injection, which felt frustrating. Contrast with:

video.srcObject = (await new Promise(r => controller.onsourceswitch = r)).stream.clone();
// the tracks in video.srcObject are now surface tracks
await new Promise(r => controller.onsourceswitch = r);
// the tracks in video.srcObject have been injected

youennf commented 1 month ago

My thinking is a sourceswitch event that fires whenever the user switches source (with no requirement to stop tracks)

The callback approach allows this as well. I was not clear about it previously in this thread (sorry about that), setting the callback would not be a signal for the UA to go to the switch mode and stop the previous tracks.

Instead, we stick with the injection model for old tracks. The web page can stop the old tracks anyway. I am ok adding an option so that the web page tells the UA to stop the tracks (hence the various proposals I made on top of the callback). We need though language that instructs that media is not flowing in the old tracks until the callback is executed.

Having a callback to deliver the stream is better since there is one place where you decide what to do with the new tracks (clone it, stop it...). And the spec can be made clear that MediaStreamTracks are not created if the callback is not set. This is more difficult with events. And I do not really see a case for multiple event listeners for this switch case (web devs already have configuration change anyway).

jan-ivar commented 1 month ago

The web page can stop the old tracks anyway

What enforces that the website can't keep both the old live injected track and the live new track? We need to specify this implicit action at a distance.

Having a callback to deliver the stream is better since there is one place where you decide what to do with the new tracks (clone it, stop it...)

If this means there's one place where you decide what happens with the old tracks (enforced by the aforementioned action at a distance), then I agree that might be a good reason for a callback.

Can we make it a settable attribute at least?

youennf commented 1 month ago

What enforces that the website can't keep both the old live injected track and the live new track? We need to specify this implicit action at a distance.

I do not see any implicit action at a distance, the website can keep both Is there an issue with that?

If this means there's one place where you decide what happens with the old tracks

In my mind, the default behavior (whether setting the callback or not) is that no track is being stopped by UA, the web page can deal with it by itself.

We can enrich the callback to make the UA stop the previous tracks, for instance:

captureController.setDisplaySurfaceChangeCallback(stream => { ... }, { mode: 'stop-previous-tracks' })
captureController.setDisplaySurfaceChangeCallback(stream => { ...; return 'stop-previous-tracks'; })
captureController.setDisplaySurfaceChangeCallback(stream => { ...; captureController.stopPreviousTracks();... })
captureController.setDisplaySurfaceChangeCallback(async stream => { await...; captureController.stopPreviousTracks();... });

It is a bit less straightforward to extend things with an event. And again, it is not really compelling to have several event listeners sharing the responsibility to stop the old tracks (or the new tracks).

Also, I could see a UA tailoring its UI based on the callback being registered (not showing the sharing audio check box if no audio was shared before the surface switching for instance). This is sort of similar to MediaSession going with callbacks as action handlers.

Can we make it a settable attribute at least?

Ah, good point, I guess this would disallow option 1 above.

dontcallmedom commented 1 month ago

(for the record - this was discussed in the joint SCCWG/WebRTC meeting last week)

jan-ivar commented 1 month ago

And I do not really see a case for multiple event listeners for this switch case

A singular callback assumes a single downstream consumer. An app may have multiple consumers of an active screen-capture, e.g. a transmitter, a preview, and a recorder, each with distinct downstream needs.

Tracks can be cloned, but a CaptureController cannot. So this becomes a choke point. We don't want different parts of an app competing to set the same callback and overwrite each other.

The web platform tries hard to avoid single-consumer APIs. See § 7.3. Don’t invent your own event listener-like infrastructure, and requestVideoFrameCallback.

I think we need a good reason to deviate from these principles.

(web devs already have configuration change anyway).

Those are per-track and cannot tell you whether the source changed or e.g. was just resized.

And the spec can be made clear that MediaStreamTracks are not created if the callback is not set. This is more difficult with events.

This seems like a marginal optimization compared to a such a significant user action.

This is sort of similar to MediaSession going with callbacks as action handlers.

That's a fairly recent API with its own flaws. But it has a good reason: Many of its actions rely on the website to maintain a singular state. What's our reason?

jan-ivar commented 1 month ago

Also, I could see a UA tailoring its UI based on the callback being registered (not showing the sharing audio check box if no audio was shared before the surface switching for instance).

We've gone around a few times on this point. Yes the absence of a callback might preclude the app handling audio, but the presence of a callback does not guarantee it.

But § 7.3. specifically mentions this point: for "an API which allows authors to start ... a process which generates notifications, use the existing event infrastructure"

youennf commented 1 month ago

Those are per-track and cannot tell you whether the source changed or e.g. was just resized.

Just check for deviceId in track.getSettings(), no need for using source switch. Source switch is about deciding whether to use the old tracks or the new tracks.

That's a fairly recent API with its own flaws.

I don't see how this particular flaw applies here.

MediaStreamTrack events are where you distribute the info. CaptureController is a single place for mission critical information (running the callback may actually trigger a freeze of video frame generation).

youennf commented 1 month ago

Discussed at editor's meeting and we will try to converge via a design document.

youennf commented 1 month ago

Another proposal to consider, maybe it could help convergel:

Rely on configurationchange event for apps to decide whether to continue processing or not. This means that when there is a source switch, video frame sending to sinks will be suspended until configurationchange event listeners are called (on a per track basis). This ensures the injection model works.
Rely on a new notification mechanism (event/callback) to expose new tracks coming in (whether audio or video).
We leave it to the UA/OS for now to decide how tracks get stopped. Depending on the OS UX, the user expectations may actually be different. We progressively converge on defining this behaviour as we experiment..

jan-ivar commented 1 month ago

Those are per-track and cannot tell you whether the source changed or e.g. was just resized.

Just check for deviceId in track.getSettings()

The spec says: "deviceIds are not exposed.". It's not listed in § 5.4 Constrainable Properties for Captured Display Surfaces.

CaptureController is a single place for mission critical information (running the callback may actually trigger a freeze of video frame generation).

Why is that critical? This is the kind of action at a (maybe not so much) distance we should document. This might justify a callback.

youennf commented 1 month ago

The spec says: "deviceIds are not exposed."

Chrome and Safari are exposing deviceIds.

Wrt callback vs. event, let's rediscuss this when we know what signals we want to expose. @tovepet is planning to create a design document we can all participate in to try reaching consensus on the underlying model.

jan-ivar commented 1 month ago

Chrome and Safari are exposing deviceIds.

I've filed crbug 372252497 and webkit 281077.

Agree callback vs. event seems secondary.

The main question seems to be over allowing late decision vs. limiting injection to tracks returned from getDisplayMedia().

What's the benefit of exposing surface tracks rather than new session tracks in the callback/event?

eladalon1983 commented 1 month ago

What's the benefit of exposing surface tracks rather than new session tracks in the callback/event?

I am a bit confused about the purpose of "new session tracks". Who would need them? The entire idea of a "session track" is that it follows the session wherever it goes, whatever the captured surface is, across user-initiated changes. If a developer needs multiple such session tracks, can't they just clone the original ones?

youennf commented 1 month ago

I've filed crbug 372252497 and webkit 281077.

It seems useful information to provide, why not instead updating the spec?

youennf commented 1 month ago

Thinking a bit more, I am not sure the separation between session tracks and surface tracks is helping shape this API.

Let's look at the following two scenarios:

User agent U1 is exposing a switch surface UX and user clicks on it. User is expecting that the new surface content will be rendered where the past track content was rendered. It seems reasonable that the same track exposes the media content and that no new track is exposed: session track model seems well suited.
User agent U2 is exposing an add/remove surface UX. User first adds a surface B and then removes the original surface A. I would think user is expecting the app to react to the new track somehow by using a new video element for rendering. This seems inline with the user agent exposing a new track for B and ending the original track A: surface track model seems well suited.

Given this, and given UX in that area is relatively new, I am not sure we can design an API that specifies a particular flow. Having an API that exposes new tracks and having a requirement that video frames of the switching surface do not get provided to sinks until some event/callback actually runs might be good enough for now. Plus some guidelines...

That said, AFAIK, the only thing UAs are doing right now is scenario 1 above. And this is what the initial message of this issue is describing. Based on this, I would suggest trying to fix this and only this for now by specifying the requirement that video frames of the switching surface do not get provided to sinks until some event/callback actually runs. I would tend to use the configuration change/deviceId combo for that as it does not require new API surface.

tovepet commented 1 month ago

I have created a design document to provide for a more structured discussion of the different proposals (view-only): https://docs.google.com/document/d/16CUOJeuXimNPi4kZHOS9rF-WhMuVvOqOg9P--Dvqi_w/edit?usp=sharing

Edit permissions will be granted to members of the WebRTC working group upon request.

jan-ivar commented 1 month ago

What's the benefit of exposing surface tracks rather than new session tracks in the callback/event?

I am a bit confused about the purpose of "new session tracks". Who would need them?

To avoid confusion, I've defined a hybrid track to clarify what I mean. But it's really the surface track I'm questioning. [Edit 2: undid my edit to capture subtle differences]

I've written up the model I have in mind as the late decision model. PTAL (edit: links fixed)

eladalon1983 commented 1 month ago

Jan-Ivar, I am unsure what your current position is, given the edits. Do you withdraw your question about "new session tracks"? My position is that there is no benefit to including new objects in a fired event, if these objects are identical to objects we had before. (That is - new session tracks are identical to the originals, and so "new" ones are useless.)

jan-ivar commented 1 month ago

Sorry for editing multiple times. Calling mine a hybrid track now, to distinguish it. It wasn't clear from the session track definition that its feature set would be limited, which would be backwards incompatible.

Do I understand correctly a driving goal of the session/surface split is to maintain the subclassing of MST?

eladalon1983 commented 1 month ago

It wasn't clear from the session track definition that its feature set would be limited, which would be backwards incompatible.

That's one vision of it, out of two - either (1) a normal, fully-fledged MediaStreamTrack, or (2) a reduced-feature-set MediaStreamTrack. But which is used is a secondary matter.

The core offering of a session track, as I understand @tovepet here, is that it addresses your (Jan-Ivar's) expressed desire, to be able to seamlessly transition between two models (injection, switch-track).

Your models allow "switching between models at any time."
Tove's model allows "using both or either at any time, concurrently or consecutively."

I believe it is objectively true, that Tove's model is more flexible.

youennf commented 1 month ago

I am a bit lost in what we are trying to solve here. Can we add a scope/use case section to the design document?

eladalon1983 commented 1 month ago

I am a bit lost in what we are trying to solve here. Can we add a scope/use case section to the design document?

We should also rely a bit more on an exploration of the uses cases, which I see only includes a single use case atm. I have taken the liberty to add 4 more.

tovepet commented 1 month ago

I have added a Scope section with the following bullets that I believe we want to solve in the first step:

Cross-type surface-switching
Add audio-sharing after the fact
Frame delivery is clearly separated for the captured surfaces before and after the switch.

Is this inline with what the rest of you think?

w3c / mediacapture-screen-share-extensions

Auto-pause capture when user switches captured content #4