w3c / mediacapture-extensions

Extensions to Media Capture and Streams by the WebRTC Working Group
https://w3c.github.io/mediacapture-extensions/
Other
19 stars 15 forks source link

Face Detection. #44

Open riju opened 2 years ago

riju commented 2 years ago

Why ?

Face Detection on Video Conferencing. Support WebRTC-NV use cases like Funny Hats, etc On client side, developers have to use Computer Vision libraries (OpenCV.js / TensorFlow.js) either with a WASM (SIMD+Threads) or a GPU backend for acceptable performance. Many developers would resort to cloud based solutions like Face API from Azure Cognitive Services or Face Detection from Google Cloud's Vision API. On modern client platforms, we can save a lot of data movement and even on-device computation by leveraging the work the camera stack / Image Processing Unit (IPU) anyways does to improve image quality, for free.

What ?

Prior Work WICG has proposed the Shape detection API which enables Web applications to use a system-provided face detector, but the API requires that the image data be provided by the Web application itself. To use the API, the application would first need to capture frames from a camera and then give the data to the Shape detection API. This may not only cause extraneous computation and copies of the frame data, but may outright prevent using the camera-dedicated hardware or system libraries for face detection. Often the camera stack performs face detection in any case to improve image quality (like 3A algorithms) and the face detection results could be made available to applications without extra computation.

Many platforms offer a camera API which can perform face detection directly on image frames from the system camera. The face detection can be assisted by the hardware which may not allow applying the functionality to user-provided image data or the API may prevent that.

Platform Support OS API FaceDetection
Windows Media Foundation KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION
ChromeOS/Android Camera HAL3 STATISTICS_FACE_DETECT_MODE_FULL STATISTICS_FACE_DETECT_MODE_SIMPLE
Linux GStreamer facedetect
macOS Core Image Vision CIDetectorTypeFace VNDetectFaceRectanglesRequest
ChromeOS + Android Chrome OS and Android provide the Camera HAL3 API for any camera user. The API specifies a method to transfer various image-related metadata to applications. One metadata type contains information on detected faces. The API allows selecting the face detection mode with STATISTICS_FACE_DETECT_MODE Returns
STATISTICS_FACE_DETECT_MODE_FULL face rectangles, scores, and landmarks including eye positions and mouth position.
STATISTICS_FACE_DETECT_MODE_SIMPLE only face rectangles and confidence values.

In Android, the resulting face statistics is parsed and stored into class Face.

Windows Face detection is performed in DeviceMFT on the preview frame buffers. The DeviceMFT integrates the face detection library, and turns on features, when requested by application. Face detection is enabled with property ID KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION. When enabled, the face detection results are returned using metadata attribute MF_CAPTURE_METADATA_FACEROIS which contains, for each face, the face coordinates:

typedef struct tagFaceRectInfo {
  RECT Region;
  LONG confidenceLevel;
} FaceRectInfo;

The API also supports blink and smile detection which can be enabled with property IDs KSCAMERA_EXTENDEDPROP_FACEDETECTION_BLINK and KSCAMERA_EXTENDEDPROP_FACEDETECTION_SMILE.

macOS Apple offers face detection using Core Image CIDetectorTypeFace or Vision VNDetectFaceRectanglesRequest.

How ?

Strawman proposal

<script>
// Check if face detection is supported by the browser.
const supports = navigator.mediaDevices.getSupportedConstraints();
if (supports.faceDetection) {
    // Browser supports camera face detection.
} else {
    throw('Face detection is not supported');
}

// Open camera with face detection enabled and show to user.
const stream = await navigator.mediaDevices.getUserMedia({
    video: { faceDetection: true }
});
const video = document.querySelector("video");
video.srcObject = stream;

// Get face detection results for the latest frame
videoTracks = stream.getVideoTracks();
videoTrack = videoTracks[0];
const settings = videoTrack.getSettings();
if (settings.faceDetection) {
    const detectedFaces = settings.detectedFaces;
    for (const face of detectedFaces) {
        console.log(
         ` Face @ (${face.boundingBox.x}, ${face.boundingBox.y}),` +
         ` size ${face.boundingBox.width}x${face.boundingBox.height}`);
    }
}
</script>
youennf commented 2 years ago

Seems worth moving to https://github.com/w3c/mediacapture-extensions

dontcallmedom commented 2 years ago

This was presented and discussed during a TPAC 2021 breakout, and further discussed during the Nov 2021 WebRTC meeting.

From the latter, feedback included:

youennf commented 2 years ago

A few thoughts from the past meeting:

youennf commented 2 years ago

As of contour information vs. simpler rectangle information, I'd like to understand what drivers currently generate (my guess is a set of rectangles) and what they might produce in the future (contours maybe?). Starting simple with a set of rectangles does not seem to bad to me provided it is what drivers currently generate (and will probably generate for some time) and it suits reasonably well the processing what would make use of such data.