w3c / mediacapture-extensions

Extensions to Media Capture and Streams by the WebRTC Working Group
https://w3c.github.io/mediacapture-extensions/
Other
19 stars 15 forks source link

Add face detection constraints and VideoFrame attributes #48

Closed eehakkin closed 1 year ago

eehakkin commented 2 years ago

This spec update is a follow up to w3c/mediacapture-image#292 and allows face detection as described in #44.

The changes include a new face detection constrainable properties which are used for controlling the face detection.

The face detection results are exposed by VideoFrames through a new readonly detectedFaces sequence attribute.

This allows following kind of code to be used for face detection:

// main.js:
// Check if face detection is supported by the browser
const supports = navigator.mediaDevices.getSupportedConstraints();
if (supports.faceDetectionMode) {
  // Browser supports face contour detection.
} else {
  throw('Face contour detection is not supported');
}

// Open camera with face detection enabled
const stream = await navigator.mediaDevices.getUserMedia({
  video: {faceDetectionMode: ['bounding-box', 'contour']}
});
const [videoTrack] = stream.getVideoTracks();

// Use a video worker and show to user.
const videoElement = document.querySelector("video");
const videoGenerator = new MediaStreamTrackGenerator({kind: 'video'});
const videoProcessor = new MediaStreamTrackProcessor({track: videoTrack});
const videoSettings = videoTrack.getSettings();
const videoWorker = new Worker('video-worker.js');
videoWorker.postMessage({
  videoReadable: videoProcessor.readable,
  videoWritable: videoGenerator.writable
}, [videoProcessor.readable, videoGenerator.writable]);
videoElement.srcObject = new MediaStream([videoGenerator]);

// video-worker.js:
self.onmessage = async function(e) {
  const videoTransformer = new TransformStream({
    async transform(videoFrame, controller) {
      for (const face of videoFrame.detectedFaces) {
        console.log(
          `Face @ (${face.contour[0].x}, ${face.contour[0].y}), ` +
                 `(${face.contour[1].x}, ${face.contour[1].y}), ` +
                 `(${face.contour[2].x}, ${face.contour[2].y}), ` +
                 `(${face.contour[3].x}, ${face.contour[3].y})`);
      }
      controller.enqueue(videoFrame);
    }
  });
  e.data.videoReadable
  .pipeThrough(videoTransformer)
  .pipeTo(e.data.videoWritable);
}

Preview | Diff

eehakkin commented 2 years ago

Sorry for closing and reopening. This one should be open and w3c/mediacapture-image#292 should be closed.

riju commented 2 years ago

@alvestrand , @youennf : We tried to incorporate the review comments as per our last discussions. Could you please take a look ?

riju commented 2 years ago

Friendly ping @alvestrand, @youennf, @jan-ivar

dontcallmedom commented 2 years ago

@riju I think it would help if you could document which comments exactly from the last discussion you incorporated and how - for instance, I still see a FaceExpression enum - with fewer values, but still some.

eehakkin commented 2 years ago

@dontcallmedom I removed face expressions completely.

eehakkin commented 2 years ago

The following example from #57 shows how to use face detection, background concealment (see #45) and eye gaze correction (see #56) with MediaStreamTrack Insertable Media Processing using Streams:

// main.js:
// Open camera.
const stream = navigator.mediaDevices.getUserMedia({video: true});
const [videoTrack] = stream.getVideoTracks();

// Use a video worker and show to user.
const videoElement = document.querySelector('video');
const videoWorker = new Worker('video-worker.js');
videoWorker.postMessage({track: videoTrack}, [videoTrack]);
const {data} = await new Promise(r => videoWorker.onmessage);
videoElement.srcObject = new MediaStream([data.videoTrack]);

// video-worker.js:
self.onmessage = async ({data: {track}}) => {
  // Apply constraints.
  let customBackgroundBlur = true;
  let customEyeGazeCorrection = true;
  let customFaceDetection = false;
  let faceDetectionMode;
  const capabilities = track.getCapabilities();
  if (capabilities.backgroundBlur && capabilities.backgroundBlur.max > 0) {
    // The platform supports background blurring.
    // Let's use platform background blurring and skip the custom one.
    await track.applyConstraints({
      advanced: [{backgroundBlur: capabilities.backgroundBlur.max}]
    });
    customBackgroundBlur = false;
  } else if ((capabilities.faceDetectionMode || []).includes('contour')) {
    // The platform supports face contour detection but not background
    // blurring. Let's use platform face contour detection to aid custom
    // background blurring.
    faceDetectionMode ||= 'contour';
    await videoTrack.applyConstraints({
      advanced: [{faceDetectionMode}]
    });
  } else {
    // The platform does not support background blurring nor face contour
    // detection. Let's use custom face contour detection to aid custom
    // background blurring.
    customFaceDetection = true;
  }
  if ((capabilities.eyeGazeCorrection || []).includes(true)) {
    // The platform supports eye gaze correction.
    // Let's use platform eye gaze correction and skip the custom one.
    await videoTrack.applyConstraints({
      advanced: [{eyeGazeCorrection: true}]
    });
    customEyeGazeCorrection = false;
  } else if ((capabilities.faceDetectionLandmarks || []).includes(true)) {
    // The platform supports face landmark detection but not eye gaze
    // correction. Let's use platform face landmark detection to aid custom eye
    // gaze correction.
    faceDetectionMode ||= 'presence';
    await videoTrack.applyConstraints({
      advanced: [{
        faceDetectionLandmarks: true,
        faceDetectionMode
      }]
    });
  } else {
    // The platform does not support eye gaze correction nor face landmark
    // detection. Let's use custom face landmark detection to aid custom eye
    // gaze correction.
    customFaceDetection = true;
  }

  // Load custom libraries which may utilize TensorFlow and/or WASM.
  const requiredScripts = [].concat(
    customBackgroundBlur    ? 'background.js' : [],
    customEyeGazeCorrection ? 'eye-gaze.js'   : [],
    customFaceDetection     ? 'face.js'       : []
  );
  importScripts(...requiredScripts);

  const generator = new VideoTrackGenerator();
  parent.postMessage({videoTrack: generator.track}, [generator.track]);
  const {readable} = new MediaStreamTrackProcessor({track});
  const transformer = new TransformStream({
    async transform(frame, controller) {
      // Detect faces or retrieve detected faces.
      const detectedFaces =
        customFaceDetection
          ? await detectFaces(frame)
          : frame.detectedFaces;
      // Blur the background if needed.
      if (customBackgroundBlur) {
        const newFrame = await blurBackground(frame, detectedFaces);
        frame.close();
        frame = newFrame;
      }
      // Correct the eye gaze if needed.
      if (customEyeGazeCorrection && (detectedFaces || []).length > 0) {
        const newFrame = await correctEyeGaze(frame, detectedFaces);
        frame.close();
        frame = newFrame;
      }
      controller.enqueue(frame);
    }
  });
  await readable.pipeThrough(transformer).pipeTo(generator.writable);
};
alvestrand commented 2 years ago

Waiting for an explainer, or possible move to WebCodecs (since it does frame mods).

youennf commented 2 years ago

We should probably work on the abstract attach-metadata-to-video-frame mechanism, then we could reuse this mechanism.

riju commented 2 years ago

@alvestrand @youennf : Here's an explainer we have been working on.

youennf commented 2 years ago

The explainer is pretty clear to me. I am not sure what we do with explainers but I guess it should be reviewed by WG and we can discuss at this point whether to merge it. Some comments on the explainer:

  1. I would be tempted to make the API surface as minimal as possible (What is the MVP?) and leave the rest to a dedicated 'future steps' section. For instance, maybe the MVP only needs faceDetectionMode constraint (not landmarks/numfaces/contourpoints constraints) with a reduced set of values ("none" and "presence"). I am not sure about the difference between presence and contour for instance, which is somehow distracting. Are FaceLandmark part of the MVP as well?
  2. The proposal is based on the VideoFrameMetadata construct, which is fine. We should try to finalise this discussion in WebCodecs.
  3. DetectedFace has a required id and required probability. I can see 'id' being useful, maybe probability should be optional.
alvestrand commented 1 year ago

Assumed to be superseded by #78