How Mediapipe calculate the Z World Coordinate?

linahourieh commented 5 months ago

Description of issue (what needs changing)

Can you please specify how Mediapipe outputs the Z-coordinates for the hand Detection Module?

Clear description

It is essential to know for developers who want to use mediapipe for developing medical purposes applications

Correct links

No response

Parameters defined

No response

Returns defined

No response

Raises listed and defined

No response

Usage example

No response

Request visuals, if applicable

No response

Submit a pull request?

No response

kuaashish commented 5 months ago

Hi @linahourieh,

The hand model utilizes a method called "scaled orthographic projection" or weak perspective, maintaining a constant average depth (Z avg).

Weak-perspective projection combines orthographic projection with scaling to simulate perspective projection, assuming uniform distance from the camera for all points on a 3D object.

We opt for weak perspective because it closely approximates perspective in many scenarios, particularly when the average variation in object depth (delta Z) along the line of sight is small compared to the fixed average depth (Z avg). This approach prevents distant objects from distorting due to perspective, instead uniformly scaling them up or down.

The model predicts a relative depth (z) based on the Z avg of typical hand depth, such as when holding a phone with one hand while the other is tracked, or when both hands are near the phone. The z range is unlimited but scales proportionally with x and y dimensions through weak projection, and shares the same units as x and y.

A primary landmark point (wrist) serves as the reference for other landmark depths, normalized via weak projection with respect to x and y coordinates.

For a deeper understanding, we recommend reviewing this thread.

Thank you!!

linahourieh commented 5 months ago

Thank you for the feedback !

So as far as my understanding, regarding the Z axis origin would be the wrist, and a world landmark Z would be an estimation based on scaled orthographic projection.

If I am interested in finding the trajectory of a landmark/tip of finger moving in space, then I would need a camera measuring the 3rd dimension?

google-ai-edge / mediapipe