google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://mediapipe.dev
Apache License 2.0
26.77k stars 5.09k forks source link

HAND WORLD LANDMARKS collapse for the back of the hand #5156

Open DehTop opened 6 months ago

DehTop commented 6 months ago

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

None

OS Platform and Distribution

Linux Ubuntu 20

MediaPipe Tasks SDK version

0.10.9

Task name (e.g. Image classification, Gesture recognition etc.)

Hand landmark detection

Programming Language and version (e.g. C++, Python, Java)

Python

Describe the actual behavior

In mediapipe v0.10.9, when detecting and visualizing the back of the hand, the 3d landmarks of the palm happen to collapse, especially the finger MCPs, producing weird and unusable results. This happens consistently across different lighting, poses, hands.

Describe the expected behaviour

The 3d landmarks should be consistent even when the back of the hand is shown. They should certainly not collapse, at least..

Standalone code/steps you may have used to try to get what you need

None -- simply run the hand landmarker as described in https://developers.google.com/mediapipe/solutions/vision/hand_landmarker/python and visualize the 3d output (world landmarks)

Other info / Complete Logs

This issue is the same as https://github.com/google/mediapipe/issues/3994, reproduced with the new release (mediapipe v.0.10.9)
DehTop commented 6 months ago

attaching here a couple of examples, also see https://github.com/google/mediapipe/issues/3994 for more (same issue)

Screenshot from 2024-02-20 10-44-28 Screenshot from 2024-02-20 10-46-35

wondering if anyone has a workaround/solution for this!

many thanks

schmidt-sebastian commented 6 months ago

Thanks for raising this. This looks like an issue with the model, but unfortunately we currently do not have plans to update our models.

DehTop commented 5 months ago

Hello @schmidt-sebastian! I've been digging a bit more on this topic and wanted to share some interesting findings Fig_1_mp_issue Fig_2_mp_issue

  1. the model generally struggles with world landmarks for the back of the hand
  2. however, the hand model is successful at detecting hand world landmarks when the fingers are mostly pointing up (positions in Figure 1.)
  3. In other poses (see Figure 2) the world landmarks undergo a strange collapse.
  4. I wonder: how the model may be successful with 1. and not successful with 2.? The inputs are conceptually the same, most computer vision models should be able to handle the same input if rotated.
  5. With this intuition in mind, I tried to rotate the input frames in Fig. 2. such that the hand would point in the same direction as Fig. 1. This produced almost the same exact result as the non-rotated version (no improvements!)
  6. Surprised by this behavior, I dug into the code and realized that mediapipe is already rotating the hand internally according to a similar logic here -- this is why 5) is not producing any improvements -- cause there's already a rotation/alignment in the mediapipe graph.

At this point, I suspect that there could be a bug in how this internal rotation/alignment is performed when the back of the hand is shown to the model! In my mind, this is the only explanation for points 4) and 5).

Wondering if there is a rather easy fix at inference time for this issue, instead of having to re-train the model.

many thanks for your work, hope this is useful and can ring a bell in someone's mind for a quick inference fix!