What is the metric scale of the face landmarks?

Pythonsegmenter commented 3 years ago

Hi, I was wondering if there is someway to translate the face landmarks obtained into a metric scale. I would for example need to know the distance from the tip of the nose to the chin.

If I read the docs then the face_geometry module seems to do the trick. However, I can nowhere find an example of how to do this in python.

Is there a way to get real metric data about face landmarks and if so, is there an example for python?

kostyaby commented 3 years ago

Hey @Pythonsegmenter,

You are right that the FaceGeometry module is currently not available on Python, only on C++. @jiuqiant might be able to navigate you around the timeline there

As for your use-case, I sense that the precision of Face Geometry (in terms of metric units) might not be good enough for you

We have a canonical face 3D model and during runtime we make sure that the scale of detected face is set to the scale of the canonical face model. With perspective cameras, there's a non-trivial challenge of distinguishing between face size and its distance away from a camera: in terms of screen coordinates, a face could seem large either cuz its large on its own OR cuz its closer to the camera. In a way, we sacrifice the scale (by making it a constant) for the sake of better translation estimation (so two faces of roughly the same size could be positioned correctly in respect to the camera)

If that's something you are fine with, then please hang on before MediaPipe has a Python API for FaceGeometry module, otherwise I'd advice you to seek other approaches that try to differentiate between the face size and the distance away from camera. Those approaches will likely be NN-based as they'd take other pieces of information about the face (for example, how other facial features are located relatively to each other) to try making a better guess about its size beyond just assuming it's some constant for every detected face

Pythonsegmenter commented 3 years ago

Hi There,

Thanks for the elaborate answer. So if I understand correctly media pipe makes every face just as large as the canonical face model. The scale thus remains actually unknown, is that correct?

But if this is the case, what is the use case of the FaceGeometry module? Is it limited to having a slightly more accurate 3D form because it no longer makes use of the weak perspective model?

kostyaby commented 3 years ago

Hey @Pythonsegmenter,

But if this is the case, what is the use case of the FaceGeometry module? Is it limited to having a slightly more accurate 3D form because it no longer makes use of the weak perspective model?

What Face Mesh module gives as an output are landmarks with XY being projected as screen coordinates and Z coordinate, which is processed in spirit of weak perspective camera model. Then, Face Geometry turns those screen XY + weak perspective Z (offsetted so that mean(Z) = 0) coordinates into some approximation of metric XYZ in respect to a perspective projection matrix given as an initialization parameter. In other words, first there was some real face in the real-world 3D metric coordinates which Face Mesh module predicted in terms of the camera plane (or screen) projected coordinates - thus losing the 3D information; what Face Geometry module gives you is some viable un-projection of those projected coordinates predicted by the Face Mesh module. As a result of that, now you have a virtual 3D space where the detected face landmarks co-exist with virtual objects - let's say, a part of virtual glasses or a virtual hat - so now you can project that Face Geometry Metric space back into the camera plan (or screen) coordinates once more, only this time with an AR asset being perfectly aligned with the detected face.

Additionally, in that Metric space you have some things to play with too. Like I mentioned earlier, all faces are assumed to have the same size - however, based on that assumption we can estimate how far from camera each face is - so if in real world there are 2 faces of similar size located in the camera view, they will be correctly ordered in the Metric space which then can aid AR rendering as well

The scale thus remains actually unknown, is that correct?

One way or another, with a Computer Vision approach you never get the actual value, only some approximation. We only approximate each face scale as a scale of some constant abstract face, which is probably the easiest way and lacks any sophistication. If you go a bit further, you'll see that human faces in general have some distribution of sizes so you'd be willing to find some data learning approach that'd try to capture that distribution. However, that'd be a challenging task too as making a camera-invariant, RGB-only face size prediction is generally fairly non-trivial. Apple's ARKit is not RGB-only, they use hardware for accurate depth estimation - that makes their life a bit easier. In general, I think there are probably some good NN-based approaches out there, I just wouldn't expect them go give me the actual face size, only some approximation which is probably better than what Face Geometry module gives you, but will very likely be worse than approaches based on depth sensors

Pythonsegmenter commented 3 years ago

Thanks for the elaborate answer once again.

Canonical face model I re-read the documentation again and I now realize that the facegeometry module only does a rigid mapping of the canonical face model. This with as goal to find the face pose transformation matrix. This matrix is needed to put virtual models in the correct pose. Furthermore it gives 'a scale' to the model.

Quick questions: Documentation states that triangular topology is 'inherited' from the canonical face model. This is not an actual real time calculation that is performed because always the same 468 vertexes are present so you don't need the canonical face model at runtime to calculate this mesh. Is this correct?

Vertex texture coordinates are inherited from the canonical face model. What's the advantage of this? Why wouldn't you use the metric coordinates of the landmarks? If you use the canonical face model coordinates then you can never have a deforming texture. (As the canonical model is rigid)

Estimating actual size So I understand that it will be very hard to get a good idea of the size. (Maybe with such an NN-based approach but this will take me too long). My current approach would be to get the size by holding an ID-card with known dimensions. I guess this should give me an estimate.

Quick questions: Could I enhance accuracy by taking multiple pictures in multiple poses. Or is Mediapipe really optimized for a frontal picture? I could for example imagine that it is a lot easier to estimate the depth of the nose from a sideview then from a frontal view.

kostyaby commented 3 years ago

Documentation states that triangular topology is 'inherited' from the canonical face model. This is not an actual real time calculation that is performed because always the same 468 vertexes are present so you don't need the canonical face model at runtime to calculate this mesh. Is this correct?

Yes, correct. Semantically, the same 468 face landmarks are received as an input, so a pre-made triangular topology is used to "connect" them into a mesh

Vertex texture coordinates are inherited from the canonical face model. What's the advantage of this? Why wouldn't you use the metric coordinates of the landmarks? If you use the canonical face model coordinates then you can never have a deforming texture. (As the canonical model is rigid)

Each vertex has 5 coordinates: XYZ are metric coordinates that change frame-to-frame + UV are texture coordinates that are static. It's a common practice in 3D modeling to have static UV coordinates serving as a medium for understandable material property (like RGB color, reflectivity, normal mapping, ...) transfer from a static texture onto a dynamic 3D model (dynamism might be coming from rigging, morph targeting or - like in our case - having the entire set of XYZ vertex coordinates being estimated by NN on every frame). You can learn more on UV mapping on Wikipedia

Quick questions: Could I enhance accuracy by taking multiple pictures in multiple poses. Or is Mediapipe really optimized for a frontal picture? I could for example imagine that it is a lot easier to estimate the depth of the nose from a sideview then from a frontal view.

For face tracking, I'd say that MediaPipe doesn't even try doing anything to aggregate predictions over multiple views and come up with a better metric. Under such circumstances, you best shot is to receive the most accurate single short (probably the frontal view, yeah) and to use your ID card trick and try transferring its known size into the face landmark scale

sgowroji commented 3 years ago

Hi @Pythonsegmenter, Did you get a chance to go through the above comments. Thanks!

murilotimo commented 3 years ago

Wouldn't it be possible to use iris detection to estimate face size in the real world?

kostyaby commented 3 years ago

Hey @murilotimo,

Theoretically, yes, there is an approach worth of exploring (please see #1891 for some additional context). However, there are a few problems with that approach which I'd like to highlight:

Z derived from left and right eyes might be quite different (please see GIFs in this comment), which already suggests that there's significant variation in the depth value signal coming from iris prediction
You might observe that metric value jumps quite a lot as you are essentially trying to transfer the scale of a small object with a small error (iris) to a larger object (face / head) - this turns the small iris landmark error into a larger face mesh error. This implies using some sort of heavy filtering
When a user blinks / has eye patch, then you'll not be observing iris such frame, which in turn requires turning this approach into something more complex (maybe extrapolating / hallucinating distance on frames when iris is not visible; or computing some sort of ratio between head pose scale and iris on some initial frames and then only using that ratio + head pose scale on all subsequent frames)

Potentially, I see benefits of using iris landmark prediction to enhance face depth estimation, provided those problems I mentioned are resolved. Unfortunately, me & team didn't have enough time / priority to explore in this direction, so the thoughts I'm sharing in this reply are pretty much as far as we went with the idea. If you'll reach a state of a nice working prototype, then please don't hesitate to share it here! I'll surely help MediaPipe users out & could be later ported into MediaPipe for everyone to share :)

Thanks!

sgowroji commented 3 years ago

We are closing this issue as of now, due to lack of activity.

google-ai-edge / mediapipe

What is the metric scale of the face landmarks? #1868