example: replace logo video with a screen-sharing track of the primary screen

daniel-abramov commented 1 year ago

A concise follow-up on https://github.com/livekit/rust-sdks/issues/92 - I've decided to build a quick-and-dirty prototype of what it would look like if we captured a screen.

For the demo, I decided to use screenshots-rs library. It's not the fastest one, but the one that is very easy to use (I wanted to use scrap, but it has not been updated for years, and rustdesk uses a fork of scrap with their own changes that are unfortunately AGPL licensed). But since this is just for the sake of the example and is going to be replaced with a media devices API eventually, I think it's fine 🙂

I decided not to do further refactorings and improvements for now (I noticed that the example code could be simplified/streamlined a bit) to keep the PR small.

The resize() step is probably very suboptimal at this stage and I'm not sure what's the most elegant and fast solution there - when real screen-sharing is used (possibly when capturing a single window), the size of the window changes. However, the dimensions of the source are configured upfront, so I wonder if changing the size of the captured window would result in re-publishing of the track (to change the source resolution) or if the resizing step would be necessary (to make sure that the image fits into what was configured on a source level). That being said, perhaps the current resizing step could be done in a better way.

P.S.: screenshots-rs screen capturing is not super fast and given that I additionally resize() the image, I doubt that a very high FPS is achievable like this (without overloading the CPU). Looking forward for a native API to efficiently capture the screen and send it to the source 🙂

CLAassistant commented 1 year ago

All committers have signed the CLA.

theomonnom commented 1 year ago

Hey thanks, this is awesome, I think you can directly capture a new I420Buffer, you don't need to resize the image. Maybe it will be faster? Also I was thinking about making a new track instead of replacing the logo :)

daniel-abramov commented 1 year ago

I think you can directly capture a new I420Buffer

Generally, it should be possible with certain APIs, however, the crate that I used for screen capturing (screenshots-rs) uses a simple API and returns an RgbaImage back. This is for sure not the most efficient way, that's why I only used it for the example (I believe once the implementation of the screen-capturing API in the SDK is complete, we could remove the screenshots-rs dependency for the example use case).

you don't need to resize the image

But if I don't resize the captured RgbaImage to match the dimensions of the VideoFrame (and the dimensions of the VideoFrame match the dimensions that the local track was instantiated with), then how is it going to work? Or do you mean that the yuv_helper will perform some downscaling if necessary?

Maybe it will be faster? Also I was thinking about making a new track instead of replacing the logo :)

Indeed so! It was just a very quick experiment to see how some "poor man's screen-sharing" could be built natively with LiveKit's current version of the SDK for some test project of mine, so decided to share it in case it may be helpful 🙂 I do think that it would indeed be much better if we could capture the screen via the SDK (avoiding dependencies like screenshots-rs), that would be more efficient an most likely faster.

daniel-abramov commented 10 months ago

@theomonnom I know that this PR seems to conflict with the upstream changes (perhaps I need to extract it into a separate example which would be a bit cleaner and easy to understand as I would not need to change much of the wgpu_room), but I wanted to clarify that thing with resizing (tested it today) before creating a follow-up PR so that I know what's the best way to handle the problem.

Problem: When we publish a track, we create a NativeVideoSource with specific dimensions, we then use the same specific dimensions when creating a VideoFrame (or, specifically, when allocating a buffer inside a VideoFrame). Then, when the new frames containing the captured screen (or window) are received, we fill the underlying VideoFrame's buffer and call NativeVideoSource::capture_frame with that VideoFrame. This is problematic because if we e.g. capture a window, we cannot guarantee that its size remains constant, hence the frames that we capture may have different dimensions that are not aligned with the dimensions of the published track.

Solutions: To me it seems like there are generally 2 possible solutions to that:

Once we spot that the underlying entity that is being captured (a window or a screen) changes its dimensions, we re-publish a track with the new NativeVideoSource with new dimensions that match the dimensions of the thing that we're capturing.
We do not re-publish the track. Instead, we resize the captured image to the dimensions of our NativeVideoSource (i.e. the dimensions of the VideoFrame's buffer), then again everything should work fine.

Both solutions seem to have pros and cons. (1) is good if the dimensions of the underlying captured entity do not change often (re-publishing is probably not that expensive right?), otherwise (2) would be efficient (?), but not when the dimensions of the entity being captured are much smaller than the dimensions of the published track (in this case we'd need to perform upscaling). Also, if the underlying library produces RgbaImages when capturing a screen, resizing those may not necessarily be very efficient (not hardware accelerated by default AFAIK).

So I wonder what's the best strategy to handle it.

PS: Another topic, maybe a separate one, is how do we handle it on the client side, when e.g. we know that viewers (who presumably use LiveKit JS SDK) render our screen-sharing in a small tile (much smaller than the dimensions of the published track), should we respond to the events from the other clients by adjusting/re-publishing the track so that we do not encode the whole screen given that we know that our screen-sharing is viewed on a 300x300 tile or something like that. Probably I'm referring to what is called "dynacast" (but I'm not 100% sure about that) 🙂

livekit / rust-sdks

example: replace logo video with a screen-sharing track of the primary screen #216