Add face detection constraints and VideoFrameMetadata members

ttoivone commented 1 year ago

This PR supersedes previous PRs related to face detection (#57 , #48 ). It adds the constraints (and related settings and capabilities) and extends the recently introduced VideoFrameMetadata to have descriptions for faces in the frames.

The feedback has been taken into consideration, simplifying the API by removing most of the previously proposed constraints. Also the mesh-based facial description has been removed. Only those judged to be essential for good performance are left. An exception is face landmarks, which are already supported by some platforms and could be therefore immediately useful. Furthermore, HumanFace-term is used instead of more generic DetectedFace to anticipate future extensions of VideoFrameMetadata.

The PR consists of two commits. The first updates the explainer and the second updates the spec.

Preview | Diff

ttoivone commented 1 year ago

Raised an issue on Mozilla's standard positions

ttoivone commented 1 year ago

+@eehakkin +@riju

chrisn commented 1 year ago

Thanks! Responding having seen you post https://github.com/w3c/webcodecs/issues/607, and speaking as BBC contributor, with my Media WG chair hat removed.

We need to be sure that there are no ethical issues exposing this to the web, concerns I mentioned at the TPAC 2021 breakout meeting: https://www.w3.org/2021/10/20-webrtc-ic-minutes.html

It's good that detecting facial expressions is stated as a non-goal, but I'd recommend going further to say it "must not" rather than "does not need to". Misdetection is a concern, as mentioned in the explainer, but also, there are privacy implications of exposing inferred emotions, at least without strong user consent.

As such I'd want to see this proposal go through wide review, including Privacy and TAG.

ttoivone commented 1 year ago

I'd recommend going further to say it "must not" rather than "does not need to". Misdetection is a concern, as mentioned in the explainer, but also, there are privacy implications of exposing inferred emotions, at least without strong user consent.

I changed the wording in the explainer as you suggested and it will be updated in the next PR. However, while not having a problem updating the wording, I don't personally see this as an issue with the proposed API. Misdetection is an issue, but by not offering the detection in the Web API we just make people to run their custom detection algorithms which hardly improves the situation. I don't see any privacy issues here -- the metadata is inferred from the same frame where it is attached to, so it does not bring any new information to whoever gets the frame what the original frame alone wouldn't already have. Privacy issues would exist only if the metadata would be delivered to user without the related video frame, but that is not done by the proposed or other Web APIs.

ttoivone commented 1 year ago

Changes in 5f8b11b:

Removed countour, replaced with with bounding-box and center-point for faces and landmarks, respectively
Replaced sequence of landmarks with members in HumanFace dict
Removed faceDetectionMaxContourPoints -- contour is now gone
Add a separate constraint to control landmark detection
Removed "nose" landmark to simplify -- not supported by Android/ChromeOS HAL3
Emphasized that facial expressions must be non-goals for the spec
Removed nullability from all members
Specified id more accurately
Once again proofread everything
More complete acknowledgements in the explainer (send me a note if you're missing)
Updated examples

ttoivone commented 1 year ago

@youennf @jan-ivar Requesting review. I couldn't add reviewers myself for some reason.

aboba commented 1 year ago

Is the PR ready for CfC?

chrisn commented 1 year ago

Thanks @ttoivone for updating the explainer, looks OK from my point of view.

ttoivone commented 1 year ago

Is the PR ready for CfC?

We are still waiting for review comments from WebCodecs team (Dan Sanders/Dale Curtis).

ttoivone commented 1 year ago

Changes in e2ec3d6:

Split detection mode enum into two: HumanFaceDetectionMode and HumanFaceLandmarkDetectionMode according to feedback

Feedback was positive from WebCodecs (Dale Curtis) "Structure looks good to me for VideoFrameMetadata. I defer to @youennf around correctness issues for what metadata should be there." @jan-ivar was removed inadvertently from the reviewer list and I still can't add reviewers myself, sorry.

@youennf @jan-ivar: Please let us know if further updates are needed into the PR before CfC, thanks.

ttoivone commented 1 year ago

@dontcallmedom

Updated the PR. All previous comments should have been now addressed either by changing the PR or otherwise. Asking reviewers to check if this version could be merged or if more changes are needed.

@jan-ivar @alvestrand @chrisn @martinthomson @youennf

In particular, after the CfC, three objections were made:

Segmentation metadata #79 Scope of Applicability #84 Variance of Results #85

As per the Feb 21 meeting, proposal was to mark issues 84 and 85 as non blocking. This updated PR should now address issue 79 which was a blocker.

Asking @adoba to mark issues 84 and 85 as non-blockers and checking if this PR now unblocks issue 79.

youennf commented 1 year ago

Given @ttoivone comment, I think we should review the PR at next editor's meeting.

ttoivone commented 1 year ago

Changes in the latest update of the PR:

Removed a constraint, only one remaining
partOf no more required
Explicitly specify that cloned MediaStreamTracks should have the same ids
Many typo fixes and refactoring the text

jan-ivar commented 1 year ago

Editors agreed to merge with the change above.

w3c / mediacapture-extensions

Add face detection constraints and VideoFrameMetadata members #78