Multi-capture (concurrent capture of multiple surfaces)

eladalon1983 commented 2 years ago

It has come to my attention that some applications wish to capture multiple display surfaces at the same time. Some examples include:

Streamers presenting multiple surfaces. [*]
Managed devices recording for compliance/training/billing reasons.

Capturing multiple display surfaces is presently achievable using existing APIs - it is possible to call getDisplayMedia() multiple times. However, this is not very ergonomic, and creates serious friction for the user:

The user has to interact with the browser's media-picker multiple times.
The user has to interact with the application multiple times, signaling that they want to capture yet another surface, and providing a new transient activation each time.
The user is liable to make mistakes when trying to remember which surfaces they've already started capturing, and which surfaces remain for them to capture.

Ideally, a single transient activation could be used for single API invocation, providing the user with a media-picker with functionality akin to checkboxes (mentioned here by way of example; we don't need to mandate specific UX elements). The user would be allowed to choose all of the display surfaces that they want to capture, then click OK once. It is clear from context that these are all of the surfaces the user was aiming to capture, and that no additional API calls to gDM or the like are necessary.

As a straw-man proposal, imagine getDisplayMedia({video: true, ..., maxSurfaces: N}). The default value of maxSurfaces is 1, and would trigger the current behavior, returning a single MediaStream. A higher value would trigger the new behavior, and return an array, [MediaStream].

mock

Finer points off the bat:

The UA may impose a limit on how many streams may be captured concurrently and prevent the user from choosing more.
If a maxSurfaces greater than 1 is specified, an array will be returned even if the user chooses one surface, to simplify things for the application.

Interesting points to discuss:

MUST/SHOULD/MAY limit the user to choose only one type of display-surface? (Without influencing which.) That is to say, maybe the user can choose any N tabs, any N windows, or any N monitors, but not a combination of K tabs and N-K screens.

CC @shangl, whose use-case prompted this.

-- [*] Imagine an instructor streaming multiple tabs, and individual viewers independently choosing which one to focus on. I mention this so as to discourage solutions involving stitching together of multiple surfaces on a logical surface.

eladalon1983 commented 2 years ago

(A less complex option using a new method that always returns a sequence is also possible. Straw-man suggestion only. The main point is that there be an API for prompting the user for a set of surfaces.)

ldenoue commented 2 years ago

Love this idea. I developed a native Mac/Windows app called Screegle https://www.appblit.com/screegle that allows a user to select one or more windows during a screen sharing session, and overlay them over a background picture of their choosing. This is very useful during Google Meet or Zoom meetings, where instead of sharing their entire screen (privacy) or having to pick just one window (lack of flexibility), people can share several windows at once, and add/remove them at any time.

It would be fantastic if web applications could provide a similar functionality through getDisplayMedia.

As you point out, getDisplayMedia can be called several times, but it is not user-friendly. Also, once the web application gets several streams, it still doesn't know where to paint them. For Screegle, the web application wants to render selected streams onto a virtual desktop image at their current position and size: it thus needs access to the X,Y positions of the originating windows (I assume getDisplayMedia's stream's width/height are the actual window width/height?)

My current workaround is ugly and very inefficient: first the user is asked to share their entire screen by clicking "Share Screen", which calls getDisplayMedia and checks that the capabilities' deviceId contains screen:.

Users then can pick individual windows by clicking "Share Window", which calls again getDisplayMedia and checks that the deviceId contains window:.

The really slow part is determining where each window appears in the overall screen image. The prototype currently uses a brute force pixel matching algorithm by sliding the image of each window across the screen image, modulo a boost in speed by examining every 16 pixels instead of each.

https://www.appblit.com/static/screegle/screegle-sdk-demo.html

It would be great if the browser had 2 new things:

a way to let users pick several windows at once (as you suggested)
add a new API that, given an opaque streamID, gives back the X,Y,W,H of the corresponding window on the desktop

I'm attaching a video demonstration showing what this current prototype does:

https://user-images.githubusercontent.com/149561/153718873-3294ca0a-17a6-42a5-aaea-38b38d0e9b77.mp4

Thanks!

eladalon1983 commented 2 years ago

That's a great demo. I'd be very interested in (orthogonally) adding an API for exposing these coordinates. I think you'll also want the z-order, btw...? Edit: Thanks, @shangl, for this link: https://web.dev/multi-screen-window-placement/.

ldenoue commented 2 years ago

@eladalon1983 yes zOrder is useful (currently, the demo first paints all selected window at their last seen position, and then paints over them the recently found windows, so z-order works, but of course it relies on image processing, which we want to avoid completely)

The native versions of Screegle for Mac and Windows poll the OS for window information.

On MacOS in Swift or ObjectiveC, the function is CGWindowListCopyWindowInfo which returns the list of all windows ordered from back to front, and contains the kCGWindowBounds rectangle of each window (among other things).

On Windows, Screegle uses ElectronJS and relies on https://github.com/sentialx/node-window-manager/blob/master/src/classes/window.ts#L22 to obtain window information, matching the windowID to the DesktopCapturer.getSources https://www.electronjs.org/docs/latest/api/desktop-capturer which conveniently uses window:native_window_ID as a way to represent available windows; which is what getCapabilities returns in deviceId for a given stream's track.

Perhaps the easiest way would be to extend the existing getCapabilities https://www.w3.org/TR/mediacapture-streams/#dictionary-mediatrackcapabilities-members API by adding {top,left,zOrder}? (width and height are already returned)

Or allow a web application that could hold several streamIDs a call on the navigator.mediaDevices.streamInfo(<array of streamIDs>) and return an array of information about these streams orders from back to front, for example [{streamID1,x1,y1,w1,h1},...,{streamID4,x4,y4,w4,h4}] (here the window corresponding to streamID4 is thus above all other windows listed in the call)

The web application would be able to call this API at any time.

jan-ivar commented 2 years ago

I think this idea presumes too much about application logic, which is seeping into browser UX here:

it assumes order doesn't matter
the UX presented ignores the question of whether to include audio or not for each user choice
users have no way to correlate choices made with their role in the application
it doesn't handle duplication well (users wanting the same choice for multiple roles in the application)
it assumes to some degree that users will make all their choices at once (can't revisit picker with the same choices)

Picking, or more broadly managing multiple things is a problem best dealt with in the context of an application IMHO, and the above problems are ones that picking one thing at a time in the context of the application doesn't have.

I think hyper-focusing on the initial picking rather than management skews the value-add of a monolithic picker like this. It's not going to save the application from needing to design a place where the user can manage the multiple choices made, but might lead some applications to think they can skip that by instead leaning on calling this picker again, expecting the user to check the boxes over and over, rather than let them edit an existing choice (which my fiddle above allows btw).

I think picking multiple things outside of the context of an application isn't very webby, a bit of an anti-pattern on the web.

The user is liable to make mistakes when trying to remember which surfaces they've already started capturing

My fiddle has thumbnails. Also, as mentioned at the meeting, browsers could highlight already-captured choices in the UX today without a spec change, if they think this information is useful. This wouldn't rely on the user making all choices at once.

jan-ivar commented 2 years ago

Vendors that wish to, should be able to experiment with prompt-bundling by detecting multiple invocations of getDisplayMedia on the same JS task today, e.g.:

const [choice1, choice2, choice3] = Promise.all([
  navigator.mediaDevices.getDisplayMedia(),
  navigator.mediaDevices.getDisplayMedia(),
  navigator.mediaDevices.getDisplayMedia(),
]);

They could satisfy such simultaneous requests using a unified picker with checkboxes. This would be backwards compatible with other browsers where users would see 3 prompts one after the other (or fewer if the user cancels).

happylinks commented 2 years ago

Great idea! At Tella we've had multiple users ask for the ability to record multiple windows at the same time, without sharing their full screen. We haven't implemented this yet, partly because like you said in the original post, the UX currently is not ideal for a user. Picking a screen is already a hard task for a lot of users (giving OS permissions, knowing the implications of picking a screen (mirror effect), etc) so we didn't want to make it more complex with multiple prompts. However, if they would be able to just multiselect windows this would make the experience a lot better.

We would indeed also want a way to make sure they don't select too many streams (like with maxSurfaces in your example), since recording a lot of streams has a performance impact.

Partly related, it would be great if we could say they can only capture windows, but I know there's already a discussion about that here.

So summarizing: I like the idea of allowing selection of multiple windows/tabs/screens and I think it will improve the UX for the screen picker and will make it nicer to implement recording/streaming apps.

Edit: Also one advantage I can see over prompting multiple times; we don't know at the start how many streams they want to share. "Add another stream/window" is something that could be added in our own UI but could also be more confusing to the user than handling it in context; the screen picker.

eladalon1983 commented 2 years ago

@jan-ivar:

it assumes order doesn't matter

First, if we think it's important to support order, it's trivial to specify that. Sequences are ordered. (And let UX worry about communicating it to the user.)

Second - see my response to bullet number 3.

the UX presented ignores the question of whether to include audio or not for each user choice

The "UX presented" was a mock illustrating what is generally possible. Don't worry, when it's time to ship, we'll have something much more refined. What matters for the W3C is that the API will specify that the user must be allowed to control whether audio is shared, and the question of whether it should controlled be per-surface or global. Let's focus on that.

users have no way to correlate choices made with their role in the application

For many applications, order doesn't matter and there are no roles.

If an application records all screens for legal-compliance, it needs not label them. In the case of a lawsuit, a human being will watch all recorded screens.
If an application streams some windows, a human being on the other end can choose which one they want to focus on. No need for labelling.

it doesn't handle duplication well (users wanting the same choice for multiple roles in the application)

I have technical answers to that (cloning, app-based UX, Capture Handle, etc.). But I think it would be a mistake to start that discussion at this time, as I believe we have run into a severe methodological issue. Please see my next comment.

eladalon1983 commented 2 years ago

@jan-ivar:

My previous comment dealt with the technical details raised in your previous comment. I'm posting a separate comment here to address what I see as a severe methodological issue, which has played itself out with minor variations over multiple proposals during the passing year.

I have presented a set of use-cases for which we have genuine Web-developer interest and need (some examples already in the thread, and possibly more to come). I have presented a general approach to address these use-cases, which yields an improvement over existing mechanisms (getDisplayMedia). That is, I am offering incremental advancement of the Web platform - the explicit purpose of the W3C.

Let's examine your response, both in this thread as well as during the interim meeting:

You have pointed out that existing mechanisms (getDisplayMedia) can be used as a partial solution. Okay, so...? We are here in order to make incremental improvements to the Web platform.
You have pointed out yet more use-cases that would not be served by my API. Generic use-cases without clear Web-developer support. Okay, so...? No API can hope to address every single conceivable use-case. If you want to address additional use cases, we should iterate, not abort.

This is not conducive to progress. I hope that we can address this, so that we may be more productive over the coming year.

eladalon1983 commented 2 years ago

We're proceeding in the WICG for the time being. (https://github.com/WICG/multicapture) I'm hoping to see this go back to the W3C.

[Edit, 2022-11-10: When I said "W3C", I meant "WebRTC WG".]

jan-ivar commented 2 years ago

Also like @ldenoue's idea about being able to create a custom picker with something like enumerateDisplaySurfaces.

Ironically, it's a deleted comment that's making me change my mind here (they likely realized the privacy issue that enumerating all the user's tabs would be, so kudos and my apologies for bringing it up again).

But with my chair-hat on: it reminded me that expanding on the capabilities of in-browser pickers is actually in keeping with our desire and efforts to move away from enumeration in related mediacapture specs, and should therefore be encouraged.

I'm hoping to see this go back to the W3C.

I'd like that as well, as a proliferation of competing APIs seems counterproductive. My concerns with the API as well as lack of implementer interest (at this point) remain, but I'd be happy to keep discussing those here (with my chair-hat off).

jan-ivar commented 2 years ago

I like proposal 2.

eladalon1983 commented 2 years ago

My ... lack of implementer interest (at this point) remain

You raise an interesting topic - the appropriateness of the W3C as a spec-hosting venue for specs which only a single browser engine intends to implement. We should discuss this question. Namely - would it not make more sense for the discussion (about a particular spec) to proceed in the WICG, until such a time as more vendors are convinced and wish to implement it or a variation of it, at which point the spec can migrate to the W3C?

[Edit, 2022-11-10: When I said "W3C", I meant "WebRTC WG".]

trookie2000 commented 5 months ago

how to achieve the multiple capture? Some users may choose not to share their desktop for privacy reasons, but choose to share several Windows at the same time Is it possible to modify > https://developer.mozilla.org/en-US/docs/Web/API/MediaDevices/getDisplayMedia source code to implement my idea, if possible how to modify it? Is there any other way if it's not possible?

w3c / mediacapture-screen-share-extensions

Multi-capture (concurrent capture of multiple surfaces) #8