Add constraint to avoid inefficient (compressed) pixel formats

henbos commented 3 years ago

Capturing in compressed pixel formats such as MJPEG adds CPU overhead because the browser has to convert (decompress) every frame before delivery to the MediaStreamTrack and beyond.

An application that cares about both quality and performance might ask with non-required constraints for Full HD. If the user has a USB 3.0 camera, Full HD might be delivered without any compression overhead. Great! But if the user has a USB 2.0 camera, due to bus limitations, Full HD would (on cameras available for testing) be captured in MJPEG, adding this overhead. The application pays a performance debt, even though it might have been just as happy if it got HD frames at a lower cost.

TL;DR: Should we add a {video:{avoidCapturingExpensivePixelFormats:true}} constraint? I'm not married to the name :)

Motivation in Numbers

Frames are captured in one format, typically NV12 (420v), YUY2 (yuvs) or MJPEG (dmb1) and then converted. Chromium traditionally converts to I420 (y420) as this format is widely supported by encoders, though it is possible to have other destination pixel formats (e.g. NV12 is supported by some encoders which could allow for a zero-conversion pipeline in WebRTC).

While YUY2 to I420 is fairly cheap, MJPEG to I420 isn't as cheap.

I set up thin "capture and convert" demo (code) and measured the CPU usage (utilization percentage normalized by CPU frequency using Intel Power Gadget and a script to obtain a sense of the "absolute" amount of work performed).

Here is the result of capturing in various formats*, converting to I420 at 30 fps and measuring the CPU and power consumption.

* Caveat: NV12 and YUY2 are captured with the built-in MacBook Pro camera and MJPEG is captured using an external Logitech Webcam C930e. The external webcam could contribute to some of the added CPU usage and power consumption, so it would be good to compare MJPEG on webcam with YUY2 on the same webcam, but the majority of the work is in the pixel conversions.

Capture Format	Resolution	Normalized CPU Usage [M cycles/s]	Power Consumption [Watt]
NV12 (420v)	640x480 (VGA)	26.51	3.10
...	1280x720 (HD)	28.94	3.23
YUY2 (yuvs)	640x480 (VGA)	20.57	2.98
...	1280x720 (HD)	30.97	3.31
MJPEG (dmb1)	640x480 (VGA)	52.85	4.85
...	1280x720 (HD)	67.99	5.28
...	1920x1080 (Full HD)	102.41	6.27

Note: I am not measuring the entire browser, I am only measuring a demo that does capturing and conversion.

In this example...

Capturing in YUY2 at HD instead of MJPEG at Full HD reduces the CPU usage from 102.41 to 30.97 M cycles/s (-70%) and the power consumption from 6.27 to 3.31 W (-47%). We save ~3 W.
Capturing in YUY2 at HD instead of MJPEG at HD reduces the CPU usage from 67.99 to 30.97 M cycles/s (-54%) and the power consumption from 5.28 to 3.31 W (-37%). We save ~2 W.

Proposal

Add a new video constraint, e.g. BooleanConstraint avoidCapturingExpensivePixelFormats, that if true allows the browser to skip pixel formats of a device that are deemed inefficient (e.g. MJPEG) if that same device supports capturing in other pixel formats.

On Logitech Webcam C930e, where Full HD is only available as MJPEG but 1024x576 and below is available as YUY2, getUserMedia would pick a lower resolution but avoid MJPEG.

P.S. This could result in a tradeoff between frame rate and resolution, more discussion needed.

fippo commented 3 years ago

thoughts on exposing the capture format via getStats()? Only way to measure large-scale

youennf commented 3 years ago

I am not sure adding another constraint is the best approach, constraints are difficult to use when they are not fully orthogonal one with each other.

Let's say we do not care about device selection and we already have a video capture track. UAs could expose the native presets of the track camera. Developers could then ask for a particular preset. Pixel/frame rate downsampling adaptation could still be a thing through regular width/height/frameRate constraints on top of preset selection.

In that case, the native capture format information would naturally be exposed within presets so that web developers can decide which preset to use.

alvestrand commented 3 years ago

Interesting. In the context of mediacapture-insertable-streams (Breakout Box), I've wondered about allowing apps to specify their desired pixel format. In the camera above, I see that it can capture in NV12 or YUV2; if the pixel format can be preserved down the chain, we might save even more power. (I assume that the measurements above were done with a conversion to RGBA).

TL;DR: perhaps it should be a "pixelFormat=" constraint?

henbos commented 3 years ago

Replying to @youennf

I am not sure adding another constraint is the best approach, constraints are difficult to use when they are not fully orthogonal one with each other.

Let's say we do not care about device selection and we already have a video capture track. UAs could expose the native presets of the track camera. Developers could then ask for a particular preset.

I don't like non-ortohonal constraints either, but I don't think you get around the "non-orthogonality" by making it a second, separate step. Because the relevant variables (which device, resoluiton/fps and pixel format) are all intertwined.

On one hand, if you care about performance more than which device to pick, by the time you have your track already capturing from a device it would be "too late" to pick a camera with better pixel formats at a higher resolution. By removing the pixel format from the device-picking equation you might end up with the wrong device.

On the other hand, if you know which device to pick already and you just want to configure the best pixel format for an already known device, then 1) I'm not sure you gained much by splitting this up into a separate step, and 2) in order to change pixel format you have to close and re-open the camera, which makes the "start your camera" phase of a website appear to be glitchy.

That said, if the app knows a device's pixel formats, it could do the tradeoff between resolution and frame rate. For example, maybe MJPEG is "bad", but the device's support for other formats like YUY2 is so bad that you would end up capturing in VGA, in which case you could say "okay, I'll pick MJPEG even though it is bad". Though maybe you could get around this with constraints and penalties instead of making it impossible to pick MJPEG.

Replying to @alvestrand

In the camera above, ...

Yes but note the "caveat" above the table: NV12 and YUY2 are from the built-in MacBook Pro camera and MJPEG is from an exernal webcam. This external webcam does also support YUY2 (not NV12) but the set of resolutions and frame rates in this format are different than the table's YUY2.

To get a full picture on the impact I need to compare ext webcam's MJPEG with ext webcam's YUY2, for example there may also be overhead due to the fact that the camera is external, but I have not done those measurements yet!

(I assume that the measurements above were done with a conversion to RGBA).

They're all conversions to I420, which is supported by all Chromium's encoders and decoders. Some encoders support NV12, but not all of them, so it's not guaranteed that NV12 is better than YUY2.

I see that it can capture in NV12 or YUV2; if the pixel format can be preserved down the chain, we might save even more power. [...] TL;DR: perhaps it should be a "pixelFormat=" constraint?

We might but we might also not, depending on encoders, rendering capabilities, etc. Which format is or isn't a good choice would be very much dependant on implementation details. For the time being, we can say "MJPEG is bad because it is compressed" but having the application decide whether to use YUY2 or NV12 doesn't make much sense without exposing other browser and system capabilities. And the tradeoff between YUY2 and NV12 is much, much smaller than the difference between any of those and MJPEG.

youennf commented 3 years ago

I don't like non-ortohonal constraints either, but I don't think you get around the "non-orthogonality" by making it a second, separate step.

Agreed that constraints are the only way for device picking. There is a tension though between constraints and the in-chome device picker we envision in the future. It would be good to understand what we think of the future of constraints in that in-chome device picker world.

Back to my original point, I think that the current API makes it very hard for web developers to handle correctly non-orthogonal constraints. Your proposal here further strengthens the idea that we should expose native preset information to web developers. This could be through getCapabilities or in some other form.

Once web developers have that information, applyConstraints is probably fine as long as we ensure that there will be an easy and unambiguous way for web developers to end up selecting the particular preset. I guess this could be dealt with as a separate issue

alvestrand commented 3 years ago

Non-orthogonal constraints is why both getUserMedia() and applyConstraints() insist on entering all the constraints at once, rather than allowing you to set one parameter at a time; they featured heavily in the discussions about API shape (the most obvious non-orthogonal constraints are width, height and aspect ratio).

The idea of device-generated presets - "this set of constraints will work well together" - is appealing, but of course it's a huge fingerprinting surface, which means that it can only be available after opening the device.

WRT reopening: yes, drivers have to reopen for reconfiguration, and for some drivers/cameras, this is an expensive, slow and error-prone operation (which has been one of the major roadblocks to the bug of not turning the indicator light off when a camera only sources muted tracks). Still, I think we'll have to eat that cost at some point.

henbos commented 3 years ago

Perhaps we should just say "powerEfficient: true" and revisit full-blown app picking the exact capturing format in a separate issue. I do still think there is usefulness in the browser having a say in which pixel format to use. E.g. NV12 might be more efficient for rendering on macOS, but I don't know if that is true on Windows, and so on.

youennf commented 3 years ago

The idea of device-generated presets - "this set of constraints will work well together" - is appealing, but of course it's a huge fingerprinting surface, which means that it can only be available after opening the device.

Agreed.

WRT reopening: yes, drivers have to reopen for reconfiguration, and for some drivers/cameras, this is an expensive, slow and error-prone operation

getUserMedia is often implemented as follow:

Get camera permission
Start opening the camera
Resolve promise with a stream and execute the correspondingJS handler
Finish opening the camera (asynchronous task).

A User-Agent could decide to execute step 3 before step 2. This would allow the web application to call applyConstraints before the camera is started so that a single camera configuration is done.

This would delay opening the camera a bit. I haven't measured by how much in Safari but this may be acceptable. The nice thing about this flow is that this allows getUserMedia to concentrate solely on selection, with a hopefully very limited perf penalty.

henbos commented 3 years ago

I just heard that some Intel hardware may support HW-accelerated MJPEG to NV12 conversion, which if true and used by browsers, would make what are efficient pixel formats browser and HW dependent. By phrasing the constraint in terms of efficiency rather than explicitly allowing or disallowing certain pixel formats, we could enable or disable MJPEG depending on hardware capabilities.

jan-ivar commented 2 years ago

@henbos is this issue still of interest?

henbos commented 2 years ago

It could be, but I'm not working on this at the moment. How about iceboxing it?

henbos commented 2 years ago

I'd like to revisit this now. I'm thinking of the performance difference of capturing in 1080p if you have MJPEG or if you have any non-compressed pixel format. With NV12 capture and HW encoding doing 1080p might not be that big of a deal, but MJPEG and SW encoding... that's another story. It'd be good to avoid the bad path before it happens, so a constraint like this could be more than helpful.

w3c / mediacapture-extensions

Add constraint to avoid inefficient (compressed) pixel formats #13

Motivation in Numbers

Proposal