Open eehakkin opened 2 months ago
Thanks @eehakkin In the explainer, we list the difference between Blur and Mask , provide an example code to create a green screen using this feature and also a demo of what BG Segmentation MASK looks like in our Chrome PoC and what you can do (replacement, gif, image, green screen, etc).
In many cases, it might be important to have access to the original camera feed, so BG MASK retains the original frames intact, does segmentation and provides mask frames in addition to the original video frames thus web applications receive both the original frames and mask frames in the same video frame stream
This PR follows up our presentation of BG Segmentation MASK in the monthly WebRTC WG call [Minutes]
PTAL @jan-ivar @aboba @alvestrand @youennf
By the way, having spoken to some people who work on camera effects in video-conferencing applications, I have some more feedback. (Not sure if this has been discussed in the past.)
Video conferencing applications often have to be very careful about what models they use, two interesting reasons being:
I am getting the feeling that, if we want serious Web apps to use this valuable work, it might be necessary to also expose something about the underlying model. I am not sure what the MVP is in that regard; possibly even just some stable identifier that apps can use against an allowlist of models/implementation that they had vetted and found sufficient?
By the way, having spoken to some people who work on camera effects in video-conferencing applications, I have some more feedback. (Not sure if this has been discussed in the past.)
Video conferencing applications often have to be very careful about what models they use, two interesting reasons being:
- Inclusion. For example, ensuring people of different skin color are treated equitably. Not only is this important for ensuring customer satisfaction - sometimes it's even a regulatory requirement.
- Consistency.
I am getting the feeling that, if we want serious Web apps to use this valuable work, it might be necessary to also expose something about the underlying model. I am not sure what the MVP is in that regard; possibly even just some stable identifier that apps can use against an allowlist of models/implementation that they had vetted and found sufficient?
Good feedback. The way we plan to implementing this API today on Chrome/Edge is using the platform models which are presently shipping by default on the underlying OS. For Windows it would be Windows Studio Effects models, On MacOS, it would be Apple's Vision models On ChromeOS, it is likely to be a Mediapipe selfie segmeter when it happens.
If you are making a native app today without bringing your own models, likely you will use what the platform provides. I would say OS teams do take care of Inclusion when training the models. I can see the Model Card and training info on a few MediaPipe/TFLite models.
I think when users bring their own models, this is a serious issue to consider. Also when major platforms are shipping efficient on-device models by default in the OS, does it make sense for every app to bring their own Segmentation models ? Differentiation vs Efficiency trade-offs.
Consistency:
I hear that many would like to have the same UX across platforms so that their use cases - green screen, BG Replacement look the same. That's why this would give the Mask data and developers can implement their use case on top of that. I understand the Mask data itself won't be pixel perfect across platforms, but could they use MediaStreamTrackProcessor
or Canvas operations with Mask data to minimize any difference in the underlying models ?
I think we should spin off the discussion about identifying the model (or some of its properties) out of this PR and into an issue.
Just some quick clarifications, though.
I think we should spin off the discussion about identifying the model (or some of its properties) out of this PR and into an issue.
Just some quick clarifications, though.
- I imagine everyone makes a serious effort to be inclusive nowadays. But video-conferencing applications might nevertheless face a regulatory requirement to demonstrate that they had done some due diligence before relying on a model provided by a third-party. So the concern I am raising here is not "is the model inclusive" but rather "can an app using the model know that it's inclusive and make that claim to regulators." (I'm not an expert here and I do not intend to cosplay one. Just a topic for you to consider if you want to ensure widespread adoption of this API.)
@aboba : Is it possible to share more information of how Microsoft does due diligence before putting in the OS ?
- The standards by which inclusion is judged may change over time. It might be necessary to update allowlists and blocklists of models over time. A Web-based video-conferencing app in 2027 might not be able to rely on a model built into an un-updated user agent from 2025.
Very true. I am expecting platform vendors to update models (maybe via drivers or OS updates) as hardware becomes more capable.
- The specific worries about consistency which I am channeling here, are about the consistency of the segmentation model.
Very true. I am expecting platform vendors to update models (maybe via drivers or OS updates) as hardware becomes more capable.
Even if a video conferencing app runs on {UA, UA-version, OS, OS-version}, it might still not know definitively which model is used, as that might be subject to experiments, out-of-band updates, etc. Apps might require more information exposed to them about the segmentation model before they can use it.
This API was discussed in https://www.w3.org/2024/04/23-webrtc-minutes.html#t08
I replaced partial interface VideoFrame
with partial dictionary VideoFrameMetadata
. That is more standard way to extend a VideoFrame
, I supppose. I also changed to the type of the new member from VideoFrame
to ImageBitmap
. That avoid recursion.
I should later add also an example, I think.
Note that this still requires registering the metadata like was done in https://github.com/w3c/webcodecs/issues/607.
Hi!
This adds capabilities, constraints and settings for background segmentation mask. Those are fairly obvious.
For the feature to be useful, the actual background segmentation mask must be provided to web apps. There are various ways to do that:
/cc @riju
Preview | Diff