Holistic solution `hand_detections_from_pose_to_rects_calculator` edge cases

Solution

Holistic

Describe the actual behavior

When using the holistic solution, the calculator first estimates the body pose and three points for each hand (wrist, index mcp and pinky mcp) - then, it estimates a rectangle of the hand area of interest, which should cover the full hand, to be sent for hand keypoints. https://github.com/google/mediapipe/blob/master/mediapipe/modules/holistic_landmark/calculators/hand_detections_from_pose_to_rects_calculator.cc#L110-L121

There are some edge cases that in my view, do not create correct hand rects, and so fail to estimate the hands. This happens when the hand estimation is off, or when the plane of the hands (the triangle created by the three points of interest) lies directly perpendicular to the camera. When this happens, the "area" of interest is tiny and so the crop will be wrong.

Hands model

If we look at the hands model that uses hand detection, it works well enough:

https://github.com/google/mediapipe/assets/5757359/7183a017-1256-48fa-a20d-083b0807cf47

Pose model

The pose model also works well enough, correctly predicting the general hand position

https://github.com/google/mediapipe/assets/5757359/a06b31a6-8a11-43d5-9bc3-e77ca73e80fc

Holistic model (pose + area + hands)

Holistic model works really well when parallel to the camera, but not when parallel to the floor.

https://github.com/google/mediapipe/assets/5757359/8ac50e97-d50b-4e23-b6d5-41b52fbdf3cf

If I recreate the holistic ROI cropping behavior (without rotation, and with rotation), I get the following. Note how the crop goes crazy.

crops crops (1)

Describe the expected behavior

I expect that the area of interest will always be correct if the hand estimation is correct, and if not, the hand model will be activated using the hand detection model as a backup.

Possible solution

Before the recroping model

If the hand pose estimation is bad, use the hands detection model
In the area-of-interest calculation, separate two cases based on the hand plane (calculation of the normal are trivial): a. if the hand plane is parallel to the camera, do as you do now b. if the hand plane is perpendicular to the camera, use a different calculation.
generic solution, would be to learn a tiny regression model, from the hand points (4 points in total) to the hand crop as predicted by the hand detection model. This is generic since it can just train on data, and is probably the best area of interest because it will predict similar bounding boxes to how the keypoints model was trained.

google-ai-edge / mediapipe