Face Detection. - Githubissues

riju commented 2 years ago

Why ?

Face Detection on Video Conferencing. Support WebRTC-NV use cases like Funny Hats, etc On client side, developers have to use Computer Vision libraries (OpenCV.js / TensorFlow.js) either with a WASM (SIMD+Threads) or a GPU backend for acceptable performance. Many developers would resort to cloud based solutions like Face API from Azure Cognitive Services or Face Detection from Google Cloud's Vision API. On modern client platforms, we can save a lot of data movement and even on-device computation by leveraging the work the camera stack / Image Processing Unit (IPU) anyways does to improve image quality, for free.

What ?

Prior Work WICG has proposed the Shape detection API which enables Web applications to use a system-provided face detector, but the API requires that the image data be provided by the Web application itself. To use the API, the application would first need to capture frames from a camera and then give the data to the Shape detection API. This may not only cause extraneous computation and copies of the frame data, but may outright prevent using the camera-dedicated hardware or system libraries for face detection. Often the camera stack performs face detection in any case to improve image quality (like 3A algorithms) and the face detection results could be made available to applications without extra computation.

Many platforms offer a camera API which can perform face detection directly on image frames from the system camera. The face detection can be assisted by the hardware which may not allow applying the functionality to user-provided image data or the API may prevent that.

Platform Support	OS	API
Windows	Media Foundation	KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION
ChromeOS/Android	Camera HAL3	STATISTICS_FACE_DETECT_MODE_FULL STATISTICS_FACE_DETECT_MODE_SIMPLE
Linux	GStreamer	facedetect
macOS	Core Image Vision	CIDetectorTypeFace VNDetectFaceRectanglesRequest

ChromeOS + Android Chrome OS and Android provide the Camera HAL3 API for any camera user. The API specifies a method to transfer various image-related metadata to applications. One metadata type contains information on detected faces. The API allows selecting the face detection mode with	`STATISTICS_FACE_DETECT_MODE`	Returns
`STATISTICS_FACE_DETECT_MODE_FULL`	face rectangles, scores, and landmarks including eye positions and mouth position.
`STATISTICS_FACE_DETECT_MODE_SIMPLE`	only face rectangles and confidence values.

In Android, the resulting face statistics is parsed and stored into class Face.

Windows Face detection is performed in DeviceMFT on the preview frame buffers. The DeviceMFT integrates the face detection library, and turns on features, when requested by application. Face detection is enabled with property ID KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION. When enabled, the face detection results are returned using metadata attribute MF_CAPTURE_METADATA_FACEROIS which contains, for each face, the face coordinates:

typedef struct tagFaceRectInfo {
  RECT Region;
  LONG confidenceLevel;
} FaceRectInfo;

The API also supports blink and smile detection which can be enabled with property IDs KSCAMERA_EXTENDEDPROP_FACEDETECTION_BLINK and KSCAMERA_EXTENDEDPROP_FACEDETECTION_SMILE.

macOS Apple offers face detection using Core Image CIDetectorTypeFace or Vision VNDetectFaceRectanglesRequest.

How ?

Strawman proposal

<script>
// Check if face detection is supported by the browser.
const supports = navigator.mediaDevices.getSupportedConstraints();
if (supports.faceDetection) {
    // Browser supports camera face detection.
} else {
    throw('Face detection is not supported');
}

// Open camera with face detection enabled and show to user.
const stream = await navigator.mediaDevices.getUserMedia({
    video: { faceDetection: true }
});
const video = document.querySelector("video");
video.srcObject = stream;

// Get face detection results for the latest frame
videoTracks = stream.getVideoTracks();
videoTrack = videoTracks[0];
const settings = videoTrack.getSettings();
if (settings.faceDetection) {
    const detectedFaces = settings.detectedFaces;
    for (const face of detectedFaces) {
        console.log(
         ` Face @ (${face.boundingBox.x}, ${face.boundingBox.y}),` +
         ` size ${face.boundingBox.width}x${face.boundingBox.height}`);
    }
}
</script>

youennf commented 2 years ago

Seems worth moving to https://github.com/w3c/mediacapture-extensions

dontcallmedom commented 2 years ago

This was presented and discussed during a TPAC 2021 breakout, and further discussed during the Nov 2021 WebRTC meeting.

From the latter, feedback included:

removing expressions annotations due to their risks of bias
reporting contour information as attached to a VideoFrame rather than at the MediaStreamTrack level
architecting it in a way that the said information can be consumed by the app, but also modified or produced, taking advantage of stream pipelines we expect to get from mediacapture-transform

youennf commented 2 years ago

A few thoughts from the past meeting:

API at MediaStreamTrack level to enable/disable the generation/exposure of this metadata seems ok.
Exposing the metadata directly within MediaStreamTrack is not very appealing.
Exposing the same metadata through requestVideoFrameCallback seems ok.
Exposing the metadata through media capture-transform within VideoFrame seems ok.

youennf commented 2 years ago

As of contour information vs. simpler rectangle information, I'd like to understand what drivers currently generate (my guess is a set of rectangles) and what they might produce in the future (contours maybe?). Starting simple with a set of rectangles does not seem to bad to me provided it is what drivers currently generate (and will probably generate for some time) and it suits reasonably well the processing what would make use of such data.

w3c / mediacapture-extensions

Face Detection. #44

Why ?

What ?

How ?