google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://ai.google.dev/edge/mediapipe
Apache License 2.0
27.72k stars 5.18k forks source link

How to convert normalized faceLandmarks to metric space [Web] #4756

Open Bersaelor opened 1 year ago

Bersaelor commented 1 year ago

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

Yes

OS Platform and Distribution

Web/Browser

MediaPipe Tasks SDK version

"@mediapipe/tasks-vision": "^0.10.4"

Task name (e.g. Image classification, Gesture recognition etc.)

Face landmark detection

Programming Language and version (e.g. C++, Python, Java)

Javascript/Typescript 5.2.2, Three.js 0.155.1

Describe the actual behavior

the faceLandmarks: NormalizedLandmark[][] have x,y,z in normalized image space. Multiplying the coordinates with facialTransformationMatrixes does not yield an array of vertices in metric 3D space

Describe the expected behaviour

In order to show the 3D data similar to the face geometry effect renderer I need to get the face-mesh coordinates in metric 3D space.

Standalone code/steps you may have used to try to get what you need

When calling

    this.faceLandmarker = await FaceLandmarker.createFromOptions(filesetResolver, {
      baseOptions: {
        modelAssetPath: `https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task`,
        delegate: "GPU"
      },
      outputFaceBlendshapes: false,
      runningMode: 'VIDEO',
      numFaces: 1,
      outputFacialTransformationMatrixes: true,
    });

and

      const results = this.faceLandmarker.detectForVideo(this.video, startTimeMs);

to detect the facial landmarks, we can get the results.facialTransformationMatrixes[0] to transform the face landmarks from a canonical face model to the detected face.

Now, additionally to getting the face pose using facialTransformationMatrixes, in order to apply similar effects as was done using the effect renderer here, I need to convert the faceLandmarks: NormalizedLandmark to metric 3D space. All my 3D setup is done in Three.js, so I can create the facemesh in Three.js too, using bufferGeometry: THREE.BufferGeometry.

Now, when I try the plain:

let data = landmarks.reduce((acc, landmark) => {
    acc.push(landmark.x);
    acc.push(landmark.y);
    acc.push(landmark.z);
    return acc;
  }, [] as number[]);

const positionArray = new Float32Array(data)
const positionAttribute = new THREE.BufferAttribute(positionArray, 3)

I can even see a face mesh, it's dimensions are still in the image-space though, not in the same metric space that facialTransformationMatrixes is in.

Now I tried multiplying the coordinates with facialTransformationMatrixes or it's inverse but I don't think that is really expected to lead the right solution, as it's meant to convert a canonical face model to metric space. And not normalized screen image coordinates into metric space. I somehow need to predicted 3D points in the predicted space.

Other info / Complete Logs

I also found some references to similar attempts here and a solution by @sureshdagooglecom

Mediapipe's landmarks value is normalized by the width and height of the image. After, getting the landmark value simply multiple the x of the landmark with the width of your image and y of the landmark with the height of your image.

but I think the thread is about the old legacy solution, not the task solution from 2023.

Bersaelor commented 1 year ago

I think in this line in effect_renderer_calculator.cc something similar is done, i'm just trying to wrap my head around the C++ code and how I could convert the same coordinate transformation to javascript.

schmidt-sebastian commented 1 year ago

The engineer behind our blendshape demo suggest using the height and width of the media input to multiply the normalized keypoints as the 3D landmarks and the transformation matrix do not go together well. As this is not my area of expertise I unfortunately can't add any mode details :(

Bersaelor commented 1 year ago

The engineer behind our blendshape demo suggest using the height and width of the media input to multiply the normalized keypoints as the 3D landmarks and the transformation matrix do not go together well. As this is not my area of expertise I unfortunately can't add any mode details :(

Yes, definitely. The transformation matrix converts from canonical face space to the 3D projected metric space. The normalized image coordinates are in image space, which has nothing to do with the canonical space, so the transformation matrix doesn't really apply to the landmarks in any way.

I have a separate implementation, which multiplies the normalized landmarks with the image width and height (plus a bunch of perspective corrections) and it doesn't look bad. That approach doesn't use the transformation matrix at all. Just, that if you go down that route, you also need the facePos relative to the face mesh, and then you have to calculate it manually (with mixed results).

Since the documentation and code suggest to use the faceTransform to convert to metric space, it would be great if we could actually use it for that though.

The solution that doesn't use the transformation-matrix looks alright when my head is in the middle:

From the front:

Screenshot 2023-09-06 at 20 20 27

With the camera rotated so you can see the face mesh from the top (to confirm it's actually a 3D mesh and not just a 2D drawing):

Screenshot 2023-09-06 at 20 22 01

Unfortunately my quick scaling+perspective correction approach doesn't work as well when the face is not in the center:

From the front:

Screenshot 2023-09-06 at 20 23 48

Same from the top (it works perspectively. but the mask isn't really nice and symetric in itself):

Screenshot 2023-09-06 at 20 24 19
kuaashish commented 1 year ago

Hello @yeemachine,

Could you please help out here. Thank you

Bersaelor commented 1 year ago

@yeemachine @kuaashish

any updates on this?

It would be really amazing if we could use the face-mesh in 3D contexts, that would allow so many AR features, like drawing makeup on the mesh or allowing 3D shadows to fall on it etc.

You can ignore my attempts at calculating a 3D mesh out of the normalized landmarks, it was just based on some random ideas.

@schmidt-sebastian does the engineer you mentioned happen to have any links to demo projects or code where the normalized landmarks are used in a 3D setting?

tobyclh commented 1 year ago

@Bersaelor I just happen to see this and I thought I would share my 2 cents. Metric space refers to the initial coordinate system where the base face mesh 3D model was defined, not the 3D coordinate system being projected onto your camera ("real world"). This matrix is useful for different purposes outside of projecting your own modal onto the picture. But I do agree that the word "Metric space" would benefit from more description.

yeemachine commented 1 year ago

@Bersaelor Unfortunately, the 3D mesh only gives normalized coordinates. So when constructing a mask using those landmarks, there really isn't any Z space being applied as it is just the scale changing which makes it harder to use in circumstances when you want environmental lighting and need the object positioned in "metric space".

The transformation matrix simulates depth and world positioning at a single point, which is really meant for pinning objects to the face in 3D. It could potentially be used in combination with the normalized landmarks as an offset to simulate depth, but since the matrix is an average of the landmarks, it might not be the most accurate approach to adding depth to your scene.

stellargo commented 1 year ago

@tobyclh "Metric space refers to the initial coordinate system where the base face mesh 3D model was defined" -- Is there a way to convert the face landmarks from mediapipe to this metric space? I don't want the real 3D coordinate system of the "real world", I just want to get rid of the effects from weak perspective projection. I would be grateful for any pointers!

stellargo commented 1 year ago

Would it work if I multiplied/divided all coordinates of the mesh by their respective z coordinates. I feel it could undo the effects ?

remmel commented 10 months ago

FYI, I have similar issue https://github.com/Rassibassi/mediapipeDemos/issues/20#issue-1535490010 (deformed face when my face is positionned on sides). If I remember and to simplify that python code is scaling and positionning the face in 3d world coordinate using average faces size/landmark (procrustes_landmark_basis)

sachinksachu commented 3 months ago

I am also trying to figure out the conversion from mediapipe facelandmark to world space. Do anyone got solution for this?