Understanding the geometry pipeline

I'm having some difficulties understanding the maths behind the geometry pipeline. I therefore tried to write down what problem it tries to solve and how it does this in a stepwise manner. You can find the explanation below. I'm quite certain that there are some errors in my explanation, so all corrections are much appreciated.

There's one thing I can't wrap my head around: Why is it so that you cannot determine the actual scale of the model? If you know the position of the camera (which is assumed to be known) then the landmarks should be sufficient to give you the actual scale shouldn't they? Imagine I take a picture of a cube positioned 10cm from a camera, then I could deduct the size of the cube from the image. Why can't this be done with the face.

Explanation

Problem: A mesh of the face is needed expressed in metric units. The output of the face landmark estimation are 468 points. The coordinates of these points are expressed as follows:

x-coordinate: pixel column divided by amount of columns (frame width)

y-coordinate: pixel row divided by amount of rows (frame height)

z-coordinate: estimation of AI divided by amount of columns (frame width)

Visible landmarks are detected by a first neural network that detects them based on the image features (how exactly is neural magic but for example by seeing that around the iris the eye is white etc..).

The invisible landmarks and the z-coordinate are detected by a second neural network. This second network is going to estimate their location by using a standard model of a face (blaze face model) and aligning this so that after projection to 2D, the visible landmarks coincide with the landmarks of the standard model. The important words here are ‘after projection’ because in this projection a weak perspective projection is used.

This weak perspective projection is the same as an orthogonal projection but with a scaling factor that differs per face. This so that faces further away appear smaller. Within a face the factor is constant. This causes a slight error as in reality the depth of the face is relevant for the size of anatomical features like ears that are further away from the camera then the nose (when looking towards the camera).

Problem summary: The issue of the landmark model is that the position of the camera is unknown. Because of this it is not possible to perform the ‘correct’ perspective projection. It was therefore chosen to perform the registration (in the second neural network) with a weak perspective model. Changing this weak perspective model to a real perspective model with the help of the camera position is what the geometry pipeline aims to do, as well as adding scale and finding the pose of the face.

Solution:

The screen landmarks are put in the camera reference system: a. A scaling is performed so that the coordinates are resized according to the size they should have in the near clipping plane of the camera. b. A translation is performed so that the (0,0) point is the center of the image.
The screen landmarks are now matched by a rigid transformation (a weighted orthogonal Procrustes problem) to the canonical model. As the resulting transformation matrix uses a uniform scale, the norm of the first column of the transformation matrix gives us the uniform scale.
The Z-coordinate is now translated to the camera reference system (by subtracting the average Z-coordinate in the old coordinate system and adding the Z-coordinate of the near clipping plane), afterwards it is rescaled using the found uniform scale, giving an estimation of the Z-scale.
Now that we have found the Z-scale and we know the Z-coordinate of the reference system (we chose it 1cm from the camera), we can use this to unproject the XY coordinates following the perspective projection formula’s found here. By unprojecting the XY coordinates we ‘remove’ the error that was made by using the weak perspective model that was mentioned in the problem summary. However, our scaling factor found in 2 was just an approximation as we were using projected XY values at the time. Therefore we estimate the scaling factor again in 5.
Using the unprojected XY values we solve the weighted orthogonal Procrustes problem again. This time giving us a more accurate scaling factor.
The final scale is now found by multiplying the scale found in 2, with the scale found in 5. Note that this is the scale to go from the screen landmarks in the camera reference system (found in 1) to the actual 3D metric scale.
Using the final scale of 6 we unproject the landmarks (x & y as in 1 and z as in 3) as in 4 to find the 3D metric scale.
Finally to find the pose transformation of the runtime metric landmarks to the canonical model we can solve the weighted orthogonal Procrustes problem (from 2) once more.

Final remarks: The geometry pipeline model solves the error induced by the weak perspective projection and gives ‘a realistic scale’ to the 3D model. However ‘a realistic scale’ should be interpreted as ‘up to a constant’. Every face will, using this method, end up with dimensions similar to the canonical model.

Hey @Pythonsegmenter,

I looked through your comment & put my thoughts below. Given that there were not many direct questions for me to answer in your comment, I'll be just writing what I think can be useful to you. Hopefully, that'll be helpful!

Why is it so that you cannot determine the actual scale of the model? If you know the position of the camera (which is assumed to be known) then the landmarks should be sufficient to give you the actual scale shouldn't they? Imagine I take a picture of a cube positioned 10cm from a camera, then I could deduct the size of the cube from the image. Why can't this be done with the face.

The problem with perspective projection is that the object size and the object distance away from the camera are interchangeable, so if you don't know anything about both the size and the distance, than there's an infinite number of options that can give you fixed XY projected coordinates. Imagine two cubes: one is of size 2cm that's 10cm away from the camera, another is of size 4cm that's 20cm away from the camera. Those two cubes will be projected exactly the same onto the camera plane. Of course, faces are not cubes in a way that it's impossible for a human face to be 100m long: there's always some distribution of human face sizes that has some reasonable lower / upper bounds. In Face Geometry module, we assume that we know the size (being the size of the canonical face model, which is hopefully somewhere close to the avg of that human face size distribution) and from that we derive the distance.

Problem summary: The issue of the landmark model is that the position of the camera is unknown. Because of this it is not possible to perform the ‘correct’ perspective projection. It was therefore chosen to perform the registration (in the second neural network) with a weak perspective model. Changing this weak perspective model to a real perspective model with the help of the camera position is what the geometry pipeline aims to do, as well as adding scale and finding the pose of the face.

Conceptually, Face Landmark NN is using a weak perspective model cuz it never sees the whole picture taking by a camera. In the face tracking pipeline, we first run Face Detector NN that finds face bounding boxes (large faces yield large bboxes, small faces yield small bboxes), then we "extract" that region from the frame into a fixed-size quad imade (256x256 or 128x128, depending on model) and that's what we feed into the Face Landmark NN. So the Face Landmark NN is never actually tasked even see faces of different size, all it sees are faces of pretty much the same size (courtesy of Face Detector NN that locates them first); the goal of Face Landmark NN is to robustly detect local XYZ coordinates. Of course, there's no such thing as a local perspective Z coordinate, so we have to go with the weak perspective Z coordinate (which, given that the Face Landmark NN always runs on faces of the same size, is really more local orthogonal projection Z coordinate). I'm pretty such that training & running one big NN on the entire image to detect the correct perspective projection XYZ face landmark coordinates is just plain slow and would never be real-time on anything other than NVIDIA graphic card, so by decomposing the E2E task into the Face Detector NN + Face Landmark NN stages we could make it real-time on fairly low-end hardware. You still get nicely working XY coordinate prediction, but I guess that the price that you are paying is that you have "weak perspective" projection instead of real metric XYZ coordinates

Now, let's say you still want to attach virtual objects / draw facepaint to detected faces to support the fun AR effects use-case - that's where the Face Geometry module kicks in. It takes face landmarks in the format they arrive from Face Landmark pipeline and tries to establish a metric 3D space + place the landmarks into that space in a plausible way. It doesn't really have a goal to provide exact metric 3D coordinates - although it definitely does provide some approximation of the metric 3D coordinates to enable somewhat believable AR experiences

Hi @Pythonsegmenter, Could you please respond to the above comment. Is your query resolved. Thanks!

Hi @Pythonsegmenter, Did you get a chance to go through the above comment. Thanks!

Assuming the above query addressed, We are closing this issue now.

Are you satisfied with the resolution of your issue? Yes No

This has been the only real discussion I've seen about why weak perspective in mediapipe. @kostyaby I wonder if you could just elaborate some more to elucidate how the model goes from the cropped face region plus the average face size value (that you have from some data distribution or which best enables average accuracy across the test data), all the way to the weak perspective values that come at the end of the pipeline ―

My first guess would be the Face Landmark NN predicts the (x,y,z) values for its cropped image region input, and then I'm not sure how it proceeds from there other than an average z-depth being calculated in those local terms. Is the z value of each predicted mesh point then shifted to the computed average z-depth as in weak projection, which would mean that the final projection to the pipeline output coordinates just assumes the face completely locally flat?

I think it would greatly help also for progress towards full clarity into the same aspect in the case of the mediapipe hands model. In the hands model my first guess would be that the predicted z values also follow on to become part of the separate "world landmarks" output set (which the face model seems to forgo), but I'm not too sure about that.

Hey @matanster,

Do you mean the Face Mesh module output or the Face Geometry module output?

The Face Mesh module outputs XY coordinates that match screen space + Z coordinate which has the same scale as X coordinate and is centered around 0 (avg Z ~ 0). The "weak perspective" term is used because of the following assumption: the change in Z coordinate within a single object (a face) is not significant relative to its distance from the camera

The Face Geometry module inputs the Face Mesh module output, applies some heuristics and outputs coordinates that satisfy a classic perspective camera model

I think it would greatly help also for progress towards full clarity into the https://github.com/google/mediapipe/issues/742#issuecomment-639104199 the mediapipe hands model. In the hands model my first guess would be that the predicted z values also follow on to become part of the separate "world landmarks" output set (which the face model seems to forgo), but I'm not too sure about that.

Screen Z coordinate of the screen hand landmark model has the same nature as for the face landmark model (Z scaled as X, centered around 0). The "world landmarks" output is tricky as it's generally not guaranteed to be project-able back into the screen coordinates - that was also confirmed during experiments I ran internally. Those landmarks are good for deriving some semantics (like hand gesture detection, smth like this) - however, I don't find them particularly useful for AR and/or other situations where you need to find correspondence between 2D screen coords and 3D virtual space coords

Not sure this answers your questions. Feel free to ask more clarifying questions and/or use concrete examples

Thank you @kostyaby, your earlier comments helped in connecting some obvious dots (pun not intended). I am still not sure where the weak projection kicks in, so perhaps you could comment to it.

the goal of Face Landmark NN is to robustly detect local XYZ coordinates. Of course, there's no such thing as a local perspective Z coordinate, so we have to go with the weak perspective Z coordinate (which, given that the Face Landmark NN always runs on faces of the same size, is really more local orthogonal projection Z coordinate)

I guess that we have a flat (no z information) image as the input to the landmarks NN and we want to project the locally predicted landmarks back to the overall input image of which that image is just a sub-box, so that they fall in their x and y positions right where they are seen on the input image. And then if the landmarks NN provides its outputs as local 3D coordinates, then we should avoid the predicted z values from taking part in that projection as they would skew the resulting XY positions if we used those z values in that projection.

If that were the case however, why should the weak orthographic projection use a real z-average of the local z values rather than just making a plain XY scaled projection to translate back to the original image plane?

I'm not totally sure whether the z values of the world landmarks are just those local z values from the local prediction of the landmarks NN (hand model and face model alike), or whether they are further modified before coming out on the world landmarks output set. For example maybe they are adjusted according to the placement of the objects in the camera field of view. I am not 100% sure from experimentation because I haven't made arrangements for moving my hand over precise traces and angles, and also the predictions are context-less (frame by frame) and hence a little noisy in ways making the exploration of such matters not as definitive.

I guess that we have a flat (no z information) image as the input to the landmarks NN and we want to project the locally predicted landmarks back to the overall input image of which that image is just a sub-box, so that they fall in their x and y positions right where they are seen on the input image. And then if the landmarks NN provides its outputs as local 3D coordinates, then we should avoid the predicted z values from taking part in that projection as they would skew the resulting XY positions if we used those z values in that projection.

This process is happening as a part of the Face Landmark module; local XYZ predicted in a "sub-box" are just flatly rotated / scaled / translated back into overall input image "box". When I say "flatly" I mean the 2D "sub-box" is stretched back into the overall image 2D "box". XY coordinates will match the intended locations; the only thing that happens to Z coordinate at this step is being scaled with the same factor as X coordinate was scaled during the "sub-box to overall box" stretching procedure. Code refs: 1, 2.

If that were the case however, why should the weak orthographic projection use a real z-average of the local z values rather than just making a plain XY scaled projection to translate back to the original image plane?

I'll admit I'm getting confused with all these types of projections you're mentioning in your comments, makes me wonder if I was using the right one to describe what's going on with our MediaPipe math 😅 If you could expand what both types of projections mean to you in terms of their logic - I should be able to tell you whether we have one, the other or smth in between

When I say "weak perspective is used" what I mean is that we care that XY coordinates match the intended locations + Z coordinate (A) has avg value of ~0 and (B) scaled the same way as X. Maybe it's not the best wording, so I'm open to recommendations regarding what would be a better way to put it concisely 🙂

I'm not totally sure whether the z values of the world landmarks are just those local z values from the local prediction of the landmarks NN (hand model and face model alike), or whether they are further modified before coming out on the world landmarks output set. For example maybe they are adjusted according to the placement of the objects in the camera field of view. I am not 100% sure from experimentation because I haven't made arrangements for moving my hand over precise traces and angles, and also the predictions are context-less (frame by frame) and hence a little noisy in ways making the exploration of such matters not as definitive.

Our ML engineers claim that those "world landmarks" are agnostic to the placement in the camera FOV + the camera intrinsics, but they capture the objects' rotation (minus camera extrinsics). This makes it hard to establish a mapping between 3D world coordinates and 2D screen coordinates (I recently saw that for Pose Landmark NN, world and screen coordinates are generally not alignable, i.e. the PnP algo fitting delta between two sets is quite high). However, I had more luck extracting semantic signals out of world coordinates (like detecting hand gestures using world landmarks produced bu the Hand Landmark NN). As always with CV applications, algos/models are rarely perfect, so I recommend giving them a try to see if it can be useful for your application - it's quite hard to tell beforehand

It looks like the Z values can be used to approximate distance when the hand rotation is stable (if you're good enough in extrapolating a good function from the raw signal scale) but highly unstable when the hand rotation changes. So it looks like it would be good to redo that part of the model on future incarnations of the pipeline ― if at all an average hand size approach has any expectancy ― as the current Z value seems useless enough to consider either this or more wisely, dropping it from the pipeline output altogether.

The mentioned code piece from before seems to have survived without changes to its calculation. The same logic is more or less replicated here.

That piece seems to only undo the effect of the bounding box provided to the landmarks prediction model being in general not orthogonal to the image axes, by applying a rotation over the image plane, the size of the bounding box's angle.

So as much as that's right, I am looking to find where does the mentioned extra handling, that previously mentioned best-effort heuristic for yielding a reasonably useful/stable projection, actually happen in the code:

Now, let's say you still want to attach virtual objects / draw facepaint to detected faces to support the fun AR effects use-case - that's where the Face Geometry module kicks in. It takes face landmarks in the format they arrive from Face Landmark pipeline and tries to establish a metric 3D space + place the landmarks into that space in a plausible way. It doesn't really have a goal to provide exact metric 3D coordinates - although it definitely does provide some approximation of the metric 3D coordinates to enable somewhat believable AR experiences

Whereas I'm currently interested in the hands pipeline, I assume that the geometry logic similarly applies in both pipelines, or at least it's good to understand one before the other, while thinking about various pathological world-landmark sets of predictions which I try to analyze.

google-ai-edge / mediapipe

Understanding the geometry pipeline #1895