google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://ai.google.dev/edge/mediapipe
Apache License 2.0
27.54k stars 5.16k forks source link

Hand tracking landmarks - Z value range #742

Closed Tectu closed 3 years ago

Tectu commented 4 years ago

I am failing to find any kind of documentation or example that would explain the exact definition/behavior of the estimated Z coordinates returned by the hand tracking graph.

We're able to successfully extract the landmark data as X, Y and Z coordinates. The X and Y coordinates are clearly normalized but the Z coordinates appear to take values to which I have no reference (they are not normalized, they are sometimes negative, sometimes positive and don't appear to adhere to any coherent scale. Clear is: They are most likely relative to each other.

Could somebody shine some light on the estimated Z coordinates - especially the scale they adhere to?

brianm-sra commented 4 years ago

I am wondering about this also.

mgyong commented 4 years ago

Normalized X gives 0 to 1 where x-origin is origin of the image x-coordinate Normalized Y gives 0 to 1 where y-origin is origin of the image y-coordinate Normalized Z where z-origin is relative to the wrist z-origin. I.e if Z is positive, the z-la ndmark coordinate is out of the page with respect to the wrist. Z is negative, the z-landmark coordinate is into the page with respect of the wrist.

brianm-sra commented 4 years ago

Thanks for responding. Can you just clarify "out of the page" and "into the page" in the case of mobile phone (Android/iOS). Does "into" mean closer to device/camera?

mgyong commented 4 years ago

@brianm-sra Take a piece of paper facing you. Into the page means moving away from the page and away your face. Out of the page means moving means closer to your face.

azahreba commented 4 years ago

@mgyong, Did I get it correctly, that?:

  1. All the coordinates are normalized
  2. Z-coordinate is relative to Z-coordinate of 0-indexed output landmark (which is the "wrist")

And, if 1. is correct. could you elaborate more on the normalization formula or share a link to the code? (linking to #739)

Tectu commented 4 years ago

Well, they are normalized values (range [0..1]). Simply scale by the frame dimensions to determine the pixel-based location:

int x = landmark_normal_x * image.width();
int y = landmark_normal_y * image.height();
brianm-sra commented 4 years ago

@mgyong You wrote that "Normalized Z gives 0 to 1" but that is NOT what I am seeing on Android as output from NormalizedLandmark.getZ() . Instead I have seen larger and smaller values ranging from approximately -80.0 to +80.0 in different tests. Here are a couple of examples from a recent test with my app based on the 3D version of hand tracker graph. Note Z values are outside of 0 to 1 range. Also I am curious how to determine how far "paper" is from Android phone camera.

MediaPipeHandTracker: Landmark[0]: (0.7117575, 0.5807782, 1.283884E-4) MediaPipeHandTracker: Landmark[1]: (0.7133818, 0.5926434, 8.609375) MediaPipeHandTracker: Landmark[2]: (0.71597075, 0.60017556, 12.796875) MediaPipeHandTracker: Landmark[3]: (0.7186818, 0.6051307, 16.171875) MediaPipeHandTracker: Landmark[4]: (0.7203635, 0.6082897, 18.5625) MediaPipeHandTracker: Landmark[5]: (0.7297122, 0.60433185, 4.828125) MediaPipeHandTracker: Landmark[6]: (0.73730975, 0.608294, 10.921875) MediaPipeHandTracker: Landmark[7]: (0.7415061, 0.6100682, 14.1640625) MediaPipeHandTracker: Landmark[8]: (0.74460053, 0.6116357, 15.7890625) MediaPipeHandTracker: Landmark[9]: (0.735119, 0.60136366, 4.2734375) MediaPipeHandTracker: Landmark[10]: (0.7428533, 0.6047355, 10.25) MediaPipeHandTracker: Landmark[11]: (0.7464073, 0.6058627, 14.0234375) MediaPipeHandTracker: Landmark[12]: (0.74882674, 0.60691535, 16.6875) MediaPipeHandTracker: Landmark[13]: (0.73774654, 0.59717864, 5.4140625) MediaPipeHandTracker: Landmark[14]: (0.7438597, 0.59961534, 10.8671875) MediaPipeHandTracker: Landmark[15]: (0.7469474, 0.60099816, 13.1171875) MediaPipeHandTracker: Landmark[16]: (0.7484096, 0.6021778, 14.765625) MediaPipeHandTracker: Landmark[17]: (0.7387837, 0.5926967, 7.7421875) MediaPipeHandTracker: Landmark[18]: (0.74255306, 0.594779, 11.65625) MediaPipeHandTracker: Landmark[19]: (0.7440435, 0.5965584, 13.4921875) MediaPipeHandTracker: Landmark[20]: (0.7447367, 0.59839743, 15.46875)

MediaPipeHandTracker: Landmark[0]: (0.58928907, 0.61223286, -5.173683E-4) MediaPipeHandTracker: Landmark[1]: (0.5756706, 0.6165022, 9.328125) MediaPipeHandTracker: Landmark[2]: (0.56418365, 0.6183474, 10.390625) MediaPipeHandTracker: Landmark[3]: (0.5544628, 0.6195656, 11.25) MediaPipeHandTracker: Landmark[4]: (0.54804426, 0.6200474, 12.1953125) MediaPipeHandTracker: Landmark[5]: (0.5571172, 0.6139561, -11.5390625) MediaPipeHandTracker: Landmark[6]: (0.5400312, 0.6170851, -12.6015625) MediaPipeHandTracker: Landmark[7]: (0.53433615, 0.620509, -6.7421875) MediaPipeHandTracker: Landmark[8]: (0.531901, 0.6224792, -1.4365234) MediaPipeHandTracker: Landmark[9]: (0.556367, 0.61244273, -14.953125) MediaPipeHandTracker: Landmark[10]: (0.537187, 0.6163406, -19.078125) MediaPipeHandTracker: Landmark[11]: (0.53178513, 0.6196333, -13.3203125) MediaPipeHandTracker: Landmark[12]: (0.5302227, 0.62104964, -7.3554688) MediaPipeHandTracker: Landmark[13]: (0.5572119, 0.6120599, -15.9609375) MediaPipeHandTracker: Landmark[14]: (0.5396851, 0.6155686, -17.671875) MediaPipeHandTracker: Landmark[15]: (0.5339748, 0.6182261, -10.6484375) MediaPipeHandTracker: Landmark[16]: (0.5321559, 0.61939704, -4.359375) MediaPipeHandTracker: Landmark[17]: (0.5594273, 0.61234933, -15.6015625) MediaPipeHandTracker: Landmark[18]: (0.54581386, 0.6152981, -15.5390625) MediaPipeHandTracker: Landmark[19]: (0.54055375, 0.6172483, -11.0625) MediaPipeHandTracker: Landmark[20]: (0.5383292, 0.6179763, -6.3984375)

mcclanahoochie commented 4 years ago

The hand model uses "scaled orthographic projection" (or, weak perspective), with some fixed average depth (Z avg).

Weak-perspective projection is an orthographic projection plus a scaling, which serves to approximate perspective projection by assuming that all points on a 3D object are at roughly the same distance from the camera.

The justification for using weak-perspective is that in many cases it approximates perspective closely. In particular for situations when the average variation of the depth of the object (delta Z) along the line of sight is small, compared to the fixed average depth (Z avg). This also allows objects at a distance not to distort due to perspective, but to only uniformly scale up/down.

The z predicted by the model is relative depth, based on the Zavg of "typical hand depth" (in the case of holding a phone with one hand and the other is tracked, or being close to the phone and showing both hands). Also, the range of z is unconstrained, but it is scaled proportionally along with x and y (via weak projection), and expressed in the same units as x & y.

There is a root landmark point (wrist) that all the other landmark depths are relative to (again normalized via weak projection w.r.t. x & y).

brianm-sra commented 4 years ago

I added some code in my Android app to keep track of minimum and maximum Z values observed, and tested while moving my hand near/far, different angles, etc. At the end of running the test, the minimum Z observed was -198.0 and maximum Z observed was 168.0. Does this make sense? Are these in line with expected minimum and maximum values of Z for 3D hand tracking graph ? These are coming from NormalizedLandmark getZ()

chensisi0730 commented 4 years ago

How to judge palm turning or not ?

LeDuySon commented 4 years ago

@brianm-sra Do you figure out how far "paper" is? Can anyone explain clearly about how to convert z to 3d camera coordinate

rajan8garg commented 4 years ago

Hi @jiuqiant & @mgyong , @Tectu , @mcclanahoochie , Please help

After reading all the contents above I am still unknown on which scale value of z depends. How z coordinate is getting change. x an y depend on screen on which pixel of the screen the landmark lies. But It is sure that z should be used as depth or distance of landmark from the device camera. But please clarify how it changes it's value.. on which thing value of z coordinate depends. Please help

lbouis commented 4 years ago

Any update on this? I am also not clear on how to use the z value to get 3d coordinates. Is there a way to get the value of Z avg at least?

Monika-Saleeb commented 4 years ago

@mgyong You wrote that "Normalized Z gives 0 to 1" but that is NOT what I am seeing on Android as output from NormalizedLandmark.getZ() . Instead I have seen larger and smaller values ranging from approximately -80.0 to +80.0 in different tests. Here are a couple of examples from a recent test with my app based on the 3D version of hand tracker graph. Note Z values are outside of 0 to 1 range. Also I am curious how to determine how far "paper" is from Android phone camera.

MediaPipeHandTracker: Landmark[0]: (0.7117575, 0.5807782, 1.283884E-4) MediaPipeHandTracker: Landmark[1]: (0.7133818, 0.5926434, 8.609375) MediaPipeHandTracker: Landmark[2]: (0.71597075, 0.60017556, 12.796875) MediaPipeHandTracker: Landmark[3]: (0.7186818, 0.6051307, 16.171875) MediaPipeHandTracker: Landmark[4]: (0.7203635, 0.6082897, 18.5625) MediaPipeHandTracker: Landmark[5]: (0.7297122, 0.60433185, 4.828125) MediaPipeHandTracker: Landmark[6]: (0.73730975, 0.608294, 10.921875) MediaPipeHandTracker: Landmark[7]: (0.7415061, 0.6100682, 14.1640625) MediaPipeHandTracker: Landmark[8]: (0.74460053, 0.6116357, 15.7890625) MediaPipeHandTracker: Landmark[9]: (0.735119, 0.60136366, 4.2734375) MediaPipeHandTracker: Landmark[10]: (0.7428533, 0.6047355, 10.25) MediaPipeHandTracker: Landmark[11]: (0.7464073, 0.6058627, 14.0234375) MediaPipeHandTracker: Landmark[12]: (0.74882674, 0.60691535, 16.6875) MediaPipeHandTracker: Landmark[13]: (0.73774654, 0.59717864, 5.4140625) MediaPipeHandTracker: Landmark[14]: (0.7438597, 0.59961534, 10.8671875) MediaPipeHandTracker: Landmark[15]: (0.7469474, 0.60099816, 13.1171875) MediaPipeHandTracker: Landmark[16]: (0.7484096, 0.6021778, 14.765625) MediaPipeHandTracker: Landmark[17]: (0.7387837, 0.5926967, 7.7421875) MediaPipeHandTracker: Landmark[18]: (0.74255306, 0.594779, 11.65625) MediaPipeHandTracker: Landmark[19]: (0.7440435, 0.5965584, 13.4921875) MediaPipeHandTracker: Landmark[20]: (0.7447367, 0.59839743, 15.46875)

MediaPipeHandTracker: Landmark[0]: (0.58928907, 0.61223286, -5.173683E-4) MediaPipeHandTracker: Landmark[1]: (0.5756706, 0.6165022, 9.328125) MediaPipeHandTracker: Landmark[2]: (0.56418365, 0.6183474, 10.390625) MediaPipeHandTracker: Landmark[3]: (0.5544628, 0.6195656, 11.25) MediaPipeHandTracker: Landmark[4]: (0.54804426, 0.6200474, 12.1953125) MediaPipeHandTracker: Landmark[5]: (0.5571172, 0.6139561, -11.5390625) MediaPipeHandTracker: Landmark[6]: (0.5400312, 0.6170851, -12.6015625) MediaPipeHandTracker: Landmark[7]: (0.53433615, 0.620509, -6.7421875) MediaPipeHandTracker: Landmark[8]: (0.531901, 0.6224792, -1.4365234) MediaPipeHandTracker: Landmark[9]: (0.556367, 0.61244273, -14.953125) MediaPipeHandTracker: Landmark[10]: (0.537187, 0.6163406, -19.078125) MediaPipeHandTracker: Landmark[11]: (0.53178513, 0.6196333, -13.3203125) MediaPipeHandTracker: Landmark[12]: (0.5302227, 0.62104964, -7.3554688) MediaPipeHandTracker: Landmark[13]: (0.5572119, 0.6120599, -15.9609375) MediaPipeHandTracker: Landmark[14]: (0.5396851, 0.6155686, -17.671875) MediaPipeHandTracker: Landmark[15]: (0.5339748, 0.6182261, -10.6484375) MediaPipeHandTracker: Landmark[16]: (0.5321559, 0.61939704, -4.359375) MediaPipeHandTracker: Landmark[17]: (0.5594273, 0.61234933, -15.6015625) MediaPipeHandTracker: Landmark[18]: (0.54581386, 0.6152981, -15.5390625) MediaPipeHandTracker: Landmark[19]: (0.54055375, 0.6172483, -11.0625) MediaPipeHandTracker: Landmark[20]: (0.5383292, 0.6179763, -6.3984375)

Can you please share how you were able to get these landmark locations? I've been trying but nothings working for me

ozgurshn commented 3 years ago

As I tried in iOS, wrist landmark's (landmark 0) z position is relative to camera and it changes between -0.0003 (nearest) and -0.0001 (farthest) in my tests. Farthest depends on my arm length, it may be different on your case. And it seems other landmarks's z estimations are relative to wrist landmark. According to camera perspective, If the other landmark is beyond wrist, value is positive, if closer than wrist then it's negative. Thanks for the explanation @mcclanahoochie.

ozgurshn commented 3 years ago

In iOS, I see landmark[0]'s Z value changes too much when I rotate my hand even I keep wrist stable. I guess this could be because of image dataset's bias or Z normalization. I recorded some values to get idea of how z is normalized. I recorded values of WristZ (abs(landmark[0]'s Z) * image_width), Y distance from wrist to top of middle finger (wristTotop), square of wristTotop, difference of wrist z and middle finger top Z (zDist), and draw a chart. It seems there is an inverse proportion between wrist Z and square of wristTotop. I'm planning to use square of wristTotop to denormalize Z in order to prevent hand rotation effect on wrist's Z. While recording this, I only record my left hand's keeping my wrist stable, I rotate my hand up to 30 degrees right. leftHand

pani0815 commented 3 years ago

Could someone please explain cleary what represents the value of the Z-Axis? I also experienced that my values from the z-Axis change a lot only if i move my hand a little bit.

fanzhanggoogle commented 3 years ago

Hi, the z-coordinate is centered at wrist and we use an arbitrary scaling factor to normalize the output from the model. You can consider it as more like a 2.5D vs true 3D in world coordinate. The idea is to provide relative position of the finger joints, e.g. which joint is closer. We are working on a new version of the hand model that should provide much better z-coordinate. The previous version is very limited as we only used projected synthetic dataset.

Enny13 commented 3 years ago

Just to be precise, am I correct in the following assumptions about calculating z and coordinates in general? Let l::21x3 be landmarks of one hand in absolute coordinates with origin in camera, s is the side of square picture, f is the focal length of camera, i.e absolute z coordinate of projection plane, z_avg is average z-coordinate, then

def absolute_to_mediapipe(l, z_avg, f):
  l_normalized = l / s
  z_avg_normalized = z_avg / s

  # weak orthographic projection
  l_weakly_projected = np.empty((21, 3))
  l_weakly_projected[:, [0, 1]] = l_normalized / z_avg * f
  l_weakly_projected[:, [2]] = l_normalized / z_avg_normalized # == l / z_avg

  # move origin to wrist by z
  l_wrist_relative[:, 2] = l_weakly_projected[:, 2] - l_weakly_projected[0, 2]

  # move x, y origin to left top corner (we assume that 'y' axis is already looking down)
  l_decentered[: [0, 1]] += 0.5

  return l_decentered
zhongyi-zhou commented 3 years ago

Hi, the z-coordinate is centered at wrist and we use an arbitrary scaling factor to normalize the output from the model. You can consider it as more like a 2.5D vs true 3D in world coordinate. The idea is to provide relative position of the finger joints, e.g. which joint is closer. We are working on a new version of the hand model that should provide much better z-coordinate. The previous version is very limited as we only used projected synthetic dataset.

Does it mean that the current version cannot provide exact 3D coordinate values, and the current z value can only be considered as a reference to tell which joint is closer to the wrist?

gitam2869 commented 3 years ago

Well, they are normalized values (range [0..1]). Simply scale by the frame dimensions to determine the pixel-based location:

int x = landmark_normal_x * image.width();
int y = landmark_normal_y * image.height();

It's working on some devices and in some devices it's not showing correct output. Do you have any other solution for it, if yes please let us know. Thank you

matanox commented 3 years ago

Interestingly, version 0.8.6 claims 3D positioning for the Pose and Holistic models whereby the Holistic model is as I understand one that includes the Hands model in its pipeline.

Would be nice if someone in the know would relate to whether this new version implies any major change to the semantics or accuracy of the hands model.

sgowroji commented 3 years ago

We are closing this issue now as we see the main query is answered in this thread at https://github.com/google/mediapipe/issues/742#issuecomment-637162399. Please add new issues with new questions rather adding it in the below thread. Thanks!

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No

matanox commented 3 years ago

Thanks, so I gather from that comment that version 0.8.6 claims 3D positioning for the Pose and Holistic models does not imply any changes to the hands model, which is informatively helpful. Thank you.

buaacyw commented 3 years ago

The hand model uses "scaled orthographic projection" (or, weak perspective), with some fixed average depth (Z avg).

Weak-perspective projection is an orthographic projection plus a scaling, which serves to approximate perspective projection by assuming that all points on a 3D object are at roughly the same distance from the camera.

The justification for using weak-perspective is that in many cases it approximates perspective closely. In particular for situations when the average variation of the depth of the object (delta Z) along the line of sight is small, compared to the fixed average depth (Z avg). This also allows objects at a distance not to distort due to perspective, but to only uniformly scale up/down.

The z predicted by the model is relative depth, based on the Zavg of "typical hand depth" (in the case of holding a phone with one hand and the other is tracked, or being close to the phone and showing both hands). Also, the range of z is unconstrained, but it is scaled proportionally along with x and y (via weak projection), and expressed in the same units as x & y.

There is a root landmark point (wrist) that all the other landmark depths are relative to (again normalized via weak projection w.r.t. x & y).

Excuse me, what's the meaning of "scaled proportionally along with x and y (via weak projection)"? I haven't found any information about how to scale the Z axis like below. Thanks!

int x = landmark_normal_x image.width(); int y = landmark_normal_y image.height();

FathiMahdi commented 3 years ago

@mgyong You wrote that "Normalized Z gives 0 to 1" but that is NOT what I am seeing on Android as output from NormalizedLandmark.getZ() . Instead I have seen larger and smaller values ranging from approximately -80.0 to +80.0 in different tests. Here are a couple of examples from a recent test with my app based on the 3D version of hand tracker graph. Note Z values are outside of 0 to 1 range. Also I am curious how to determine how far "paper" is from Android phone camera.

MediaPipeHandTracker: Landmark[0]: (0.7117575, 0.5807782, 1.283884E-4) MediaPipeHandTracker: Landmark[1]: (0.7133818, 0.5926434, 8.609375) MediaPipeHandTracker: Landmark[2]: (0.71597075, 0.60017556, 12.796875) MediaPipeHandTracker: Landmark[3]: (0.7186818, 0.6051307, 16.171875) MediaPipeHandTracker: Landmark[4]: (0.7203635, 0.6082897, 18.5625) MediaPipeHandTracker: Landmark[5]: (0.7297122, 0.60433185, 4.828125) MediaPipeHandTracker: Landmark[6]: (0.73730975, 0.608294, 10.921875) MediaPipeHandTracker: Landmark[7]: (0.7415061, 0.6100682, 14.1640625) MediaPipeHandTracker: Landmark[8]: (0.74460053, 0.6116357, 15.7890625) MediaPipeHandTracker: Landmark[9]: (0.735119, 0.60136366, 4.2734375) MediaPipeHandTracker: Landmark[10]: (0.7428533, 0.6047355, 10.25) MediaPipeHandTracker: Landmark[11]: (0.7464073, 0.6058627, 14.0234375) MediaPipeHandTracker: Landmark[12]: (0.74882674, 0.60691535, 16.6875) MediaPipeHandTracker: Landmark[13]: (0.73774654, 0.59717864, 5.4140625) MediaPipeHandTracker: Landmark[14]: (0.7438597, 0.59961534, 10.8671875) MediaPipeHandTracker: Landmark[15]: (0.7469474, 0.60099816, 13.1171875) MediaPipeHandTracker: Landmark[16]: (0.7484096, 0.6021778, 14.765625) MediaPipeHandTracker: Landmark[17]: (0.7387837, 0.5926967, 7.7421875) MediaPipeHandTracker: Landmark[18]: (0.74255306, 0.594779, 11.65625) MediaPipeHandTracker: Landmark[19]: (0.7440435, 0.5965584, 13.4921875) MediaPipeHandTracker: Landmark[20]: (0.7447367, 0.59839743, 15.46875)

MediaPipeHandTracker: Landmark[0]: (0.58928907, 0.61223286, -5.173683E-4) MediaPipeHandTracker: Landmark[1]: (0.5756706, 0.6165022, 9.328125) MediaPipeHandTracker: Landmark[2]: (0.56418365, 0.6183474, 10.390625) MediaPipeHandTracker: Landmark[3]: (0.5544628, 0.6195656, 11.25) MediaPipeHandTracker: Landmark[4]: (0.54804426, 0.6200474, 12.1953125) MediaPipeHandTracker: Landmark[5]: (0.5571172, 0.6139561, -11.5390625) MediaPipeHandTracker: Landmark[6]: (0.5400312, 0.6170851, -12.6015625) MediaPipeHandTracker: Landmark[7]: (0.53433615, 0.620509, -6.7421875) MediaPipeHandTracker: Landmark[8]: (0.531901, 0.6224792, -1.4365234) MediaPipeHandTracker: Landmark[9]: (0.556367, 0.61244273, -14.953125) MediaPipeHandTracker: Landmark[10]: (0.537187, 0.6163406, -19.078125) MediaPipeHandTracker: Landmark[11]: (0.53178513, 0.6196333, -13.3203125) MediaPipeHandTracker: Landmark[12]: (0.5302227, 0.62104964, -7.3554688) MediaPipeHandTracker: Landmark[13]: (0.5572119, 0.6120599, -15.9609375) MediaPipeHandTracker: Landmark[14]: (0.5396851, 0.6155686, -17.671875) MediaPipeHandTracker: Landmark[15]: (0.5339748, 0.6182261, -10.6484375) MediaPipeHandTracker: Landmark[16]: (0.5321559, 0.61939704, -4.359375) MediaPipeHandTracker: Landmark[17]: (0.5594273, 0.61234933, -15.6015625) MediaPipeHandTracker: Landmark[18]: (0.54581386, 0.6152981, -15.5390625) MediaPipeHandTracker: Landmark[19]: (0.54055375, 0.6172483, -11.0625) MediaPipeHandTracker: Landmark[20]: (0.5383292, 0.6179763, -6.3984375)

@Monika-Saleeb how did you get the landmark for individual points ?

matanox commented 2 years ago

Sorry for re-opening. I believe the matter can be re-addressed for the better, now that world landmarks are a feature in the newest pipeline releases. The terse documentation for world landmarks v.s. those weak projection landmarks could be probably elaborated in a clarifying way, if needed from the point of view of the network's training data's coordinate system (model card doesn't help much). As we don't have the original data calibration details of the training data nor the loss function involved in training the latest network to output both types of outputs, a more canonical definition would be much implied.

I'm not sure whether still-ambiguous statements like these are a real definition:

Real-world 3D coordinates in meters with the origin at the hand’s approximate geometric center

It's probably better to replicate something in the pipeline's inline documentation such as this information which is repeated in the comments of all hands related pipeline definition files, if that information fits the current version of the model and pipeline, as it is less open to many ambiguous interpretations:

# World landmarks are real-world 3D coordinates in meters with the origin in the
# center of the hand bounding box calculated from the landmarks.
#
# WORLD_LANDMARKS shares the same landmark topology as LANDMARKS. However,
# LANDMARKS provides coordinates (in pixels) of a 3D object projected onto the
# 2D image surface, while WORLD_LANDMARKS provides coordinates (in meters) of
# the 3D object itself.

But even the above does not remove several ambiguities: for example, in what sense are they real world coordinates. Digging inside the pipeline it looks like the neural network itself outputs both types of landmarks rather than them being a downstream mediapipe transformation of one from the other.

matanox commented 2 years ago

It becomes clear to me now that the world landmarks are just a set of meter valued predictions of the distances of the landmarks to the middle of the hand as detected. Roughly from the middle finger MCP. They are world landmarks not in terms of any distance from the camera or position in the scene, but in terms of the geometry of the hand itself (the hand geometry as it is in the world, regardless of its location in the scene) rather than in image capture geometry. Of course, the predictions will be noisy, since it's only as accurate as the trained model gets on real world images.

So the world landmarks describe the hand object in terms of the distances of its landmarks from one another, regardless of the position in the captured image which the non-world landmarks predict. And of course it's noisy, and somewhat meaningless in being based on a fuzzy population average, given that the model was not trained to explicitly learn sizes. It can't really predict the distances between the landmarks other than in some "average" sense of the distribution of hand sizes it's seen during training. But this is a meaningful signal in terms of hand pose, it meets or at least corresponds with what a pose is ― the relationship of the body parts to one another.

At the level of the details, the origin of this coordinate system is derived from "the center of the hand bounding box calculated from the landmarks" (implied in a code comment), a calculation implemented in a somewhat heuristic calculator. It must be a fair/good heuristic which comes to smooth error magnitude that would otherwise arise on average if relying on just a single prediction (such as the index finger MCP alone) for the origin.

So the Z value of world landmarks is a distance value estimation from/to that calculated origin, purely describing the hand geometry (according to the image axes) and it can be used for hand pose calculations (and obviously not for distance from camera and such). To escape the image axes, you'd get the hand geometry by calculating the angles between these world landmarks coordinates.

matanox commented 2 years ago

Regarding the non-world landmarks, their x, y part is simply the viewport (image) position as predicted by the pipeline, no complications there. They range from 0 to 1 so as to fully span the image width and height.

Unlike the case with the "world landmarks", the z value of the non-world landmarks is always zero-ish for the wrist landmark, which serves as the z-axis origin for the z values of all other hand landmarks.

So the z value of the non-world landmarks could be used for comparing z axis order between landmarks and between landmarks across frames, if you're not too concerned about weak projection. The z values vary proportionally to their proximity to the origin of the z axis of the camera and image due to some scaling that has been mentioned in some comments. This is unlike the z values of the "world landmarks", which are consistent (up to prediction error) regardless of their position on the camera/image z axis.

So as is, it may be a little ambiguous to rely on the z values of the non-world landmarks as a consistent depth measure.

MehdiGol commented 2 years ago

Based on what I read and got, the Z value does not act as the world Z axis. So it is not possible to estimate the arm angle using normal math formulas. is it?

gianluca-amprimo commented 2 years ago

Hello, Can somebody explain me if world coordinates are inferred from normalized landmark plus some technique or it they are empirically estimated by the neural network as normalized coordinates? Is there any official documentation I can quote in my paper?

MiroslavPetrik commented 2 years ago

Reading through the issue, I still can't find the correct formula to denormalise the z coordinate.

To denormalise x, y of landmark i, I simply multiply by screen size:

const x = landmarks[i].x * video.width;
const y = landmarks[i].y * video.height;

For z it should be something like

// index 0 is wrist, which is root for the rest of landmarks
const wristZ = landmarks[0].z * video.width;

// index i is the rest of landmarks
const z = wristZ + landmarks[i].z;

But this does not work, few problems:

Here is python implementation which does 3D projection out of the normalised landmarks coordinates (image mode) https://github.com/geaxgx/depthai_hand_tracker/tree/main/examples/3d_visualization And here is line where denormalisation is happening. https://github.com/geaxgx/depthai_hand_tracker/blob/main/examples/3d_visualization/demo.py#L76 But the landmarks appear to be some points, not raw data from the mediapipe, so I still can't get it working. Also you can note the magic number 0.4 which is the scaling factor mentioned above https://github.com/google/mediapipe/issues/742#issuecomment-865663115 and it seems important given the python implementation uses it and works properly. Maybe the author @geaxgx could help with the formula.

So the z denormalisation, in order to get approximate wrist depth is necessary to get all 3 coordinates, so we can unproject into 3D world.

Stephenfang51 commented 2 years ago

Reading through the issue, I still can't find the correct formula to denormalise the z coordinate.

To denormalise x, y of landmark i, I simply multiply by screen size:

const x = landmarks[i].x * video.width;
const y = landmarks[i].y * video.height;

For z it should be something like

// index 0 is wrist, which is root for the rest of landmarks
const wristZ = landmarks[0].z * video.width;

// index i is the rest of landmarks
const z = wristZ + landmarks[i].z;

But this does not work, few problems:

  • magnitude - the wrist z value has exponent of e-7 which is small number, while the rest of landmarks is like 0.002
  • unclear if the multiplication for screen/camera size should happen for each landmark or wrist only?

Here is python implementation which does 3D projection out of the normalised landmarks coordinates (image mode) https://github.com/geaxgx/depthai_hand_tracker/tree/main/examples/3d_visualization And here is line where denormalisation is happening. https://github.com/geaxgx/depthai_hand_tracker/blob/main/examples/3d_visualization/demo.py#L76 But the landmarks appear to be some points, not raw data from the mediapipe, so I still can't get it working. Also you can note the magic number 0.4 which is the scaling factor mentioned above #742 (comment) and it seems important given the python implementation uses it and works properly. Maybe the author @geaxgx could help with the formula.

So the z denormalisation, in order to get approximate wrist depth is necessary to get all 3 coordinates, so we can unproject into 3D world.

Hi Sir

Did you figure out the way depthai to denormalized the z-coord ? Thanks !

matanox commented 2 years ago

mcclanahoochie thanks for having taken the time to draft that. I have revisited this every few months while learning about projective geometry. Maybe this discussion better fits outside a ticket, more broadly and patiently scoping it as "model output semantics". Perhaps you might be able to comment below and help wrap up how things are:

As I understand, it has been chosen to have the model receive as its input (during training and during inference) quite regular consumer-grade camera full-perspective projection images where depth information is likely not included (a) and synthetic data drawn from hand models where relative depth is included in the input (b). Or perhaps during training depth information has been predicted from the synthetic data by a proxy model, to augment the first data.

As it relates to the pipeline outputs, my understanding is that it breaks down to two sets of output:

  1. A weak orthographic projection output set ('landmarks' in mediapipe api parlance).

  2. A second output set depicting the relative positions of the hand's landmarks to some hand middle ('world landmarks' in the documentation parlance).

As I understand it, those last relative positions bear little credibility in terms of the absolute real-world sizes, other than lending themselves to be used for segment proportions and angle calculations, as the model's notion of size would draw from the distribution of size in its training data, whereas its predictive power at predicting the size of a hand in a picture would be weak (it could be very surprising if it were too accurate to predict the size from the visual features of a hand).

I think that explaining the role of "typical hand depth" in the z average would bring in greater clarity.

Maybe an explanation which closely follows the model inputs preparation for training and/or the training loss would be a conducive way for introducing the greatest clarity. It would be wonderful to have that kind of delineation, as well as to also motivate the choice of this projection model (as hands are not so much objects of small depth variation other than a small subset of hand poses). It would be nice to get an understanding of why this model had been chosen for the pipeline, and how the (x,y,z) landmarks prediction trickles down to those two output sets.

I think the quoted comment has been the only online discussion of the output semantics beyond the rightfully terse wording in the documentation, so it would be just wonderful to have a more rigorous explanation. Apologies for the edits.

koegl commented 2 years ago

To get a more intuitive understanding of how the model coordinates work I recommend taking a look at this web-implementation: https://codepen.io/mediapipe/pen/RwGWYJw

You can see that the output model is never translated, only its orientation is changing

wearitar commented 2 years ago

The hand model uses "scaled orthographic projection" (or, weak perspective), with some fixed average depth (Z avg).

Weak-perspective projection is an orthographic projection plus a scaling, which serves to approximate perspective projection by assuming that all points on a 3D object are at roughly the same distance from the camera.

The justification for using weak-perspective is that in many cases it approximates perspective closely. In particular for situations when the average variation of the depth of the object (delta Z) along the line of sight is small, compared to the fixed average depth (Z avg). This also allows objects at a distance not to distort due to perspective, but to only uniformly scale up/down.

The z predicted by the model is relative depth, based on the Zavg of "typical hand depth" (in the case of holding a phone with one hand and the other is tracked, or being close to the phone and showing both hands). Also, the range of z is unconstrained, but it is scaled proportionally along with x and y (via weak projection), and expressed in the same units as x & y.

There is a root landmark point (wrist) that all the other landmark depths are relative to (again normalized via weak projection w.r.t. x & y).

Where can we find the details about the "typical hand depth"? For instance Mediapipe Face provides a canonical face model together with weights, that can be used to calculate scale and project coordinates. Is it possible to calculate scale with Hands at all?

matanox commented 1 year ago

Unlike the Z values of "world landmarks", the Z values of the "landmarks" are not proportional to their X and Y sibling axes in the case of the "landmarks", but only artificially almost normalized to the same range (and even that not exactly).

These Z values are clearly relative to the predicted distance from the camera so extracting movement along the Z axis from these "landmarks" Z values is very much approachable, whereas extracting an actual function mapping from their value to the predicted distance is something requiring more thorough work.