google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://ai.google.dev/edge/mediapipe
Apache License 2.0
27.55k stars 5.16k forks source link

Holistic solution `hand_detections_from_pose_to_rects_calculator` edge cases #5373

Open AmitMY opened 6 months ago

AmitMY commented 6 months ago

Solution

Holistic

Describe the actual behavior

When using the holistic solution, the calculator first estimates the body pose and three points for each hand (wrist, index mcp and pinky mcp) - then, it estimates a rectangle of the hand area of interest, which should cover the full hand, to be sent for hand keypoints. https://github.com/google/mediapipe/blob/master/mediapipe/modules/holistic_landmark/calculators/hand_detections_from_pose_to_rects_calculator.cc#L110-L121

There are some edge cases that in my view, do not create correct hand rects, and so fail to estimate the hands. This happens when the hand estimation is off, or when the plane of the hands (the triangle created by the three points of interest) lies directly perpendicular to the camera. When this happens, the "area" of interest is tiny and so the crop will be wrong.

Hands model

If we look at the hands model that uses hand detection, it works well enough:

https://github.com/google/mediapipe/assets/5757359/7183a017-1256-48fa-a20d-083b0807cf47

Pose model

The pose model also works well enough, correctly predicting the general hand position

https://github.com/google/mediapipe/assets/5757359/a06b31a6-8a11-43d5-9bc3-e77ca73e80fc

Holistic model (pose + area + hands)

Holistic model works really well when parallel to the camera, but not when parallel to the floor.

https://github.com/google/mediapipe/assets/5757359/8ac50e97-d50b-4e23-b6d5-41b52fbdf3cf

If I recreate the holistic ROI cropping behavior (without rotation, and with rotation), I get the following. Note how the crop goes crazy.

crops crops (1)

Describe the expected behavior

I expect that the area of interest will always be correct if the hand estimation is correct, and if not, the hand model will be activated using the hand detection model as a backup.

Possible solution

Before the recroping model

  1. If the hand pose estimation is bad, use the hands detection model
  2. In the area-of-interest calculation, separate two cases based on the hand plane (calculation of the normal are trivial): a. if the hand plane is parallel to the camera, do as you do now b. if the hand plane is perpendicular to the camera, use a different calculation.
  3. generic solution, would be to learn a tiny regression model, from the hand points (4 points in total) to the hand crop as predicted by the hand detection model. This is generic since it can just train on data, and is probably the best area of interest because it will predict similar bounding boxes to how the keypoints model was trained.
AmitMY commented 6 months ago

Alright, here is a solution

Optimizing_Hand_Area_Detection_in_MediaPipe_Holistic.pdf

https://github.com/sign-language-processing/mediapipe-hand-crop-fix

I would have preferred to write a mathematical solution, so it would be easy to contribute in a PR, but ended up with a model that is very extremely lightweight.

Would be happy to know what you think, and if you could optimize this. (and train on a larger dataset)