google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://mediapipe.dev
Apache License 2.0
26.77k stars 5.09k forks source link

Hand tracking model not working #428

Closed lanreolokoba closed 4 years ago

lanreolokoba commented 4 years ago

Hi all. We've been trying to implement the hand tracking model from MediaPipe in our project that uses TensorFlow Lite on iOS and Android. We use TF Lite directly instead of using MediaPipe. We're feeding camera frames from our app to the model, performing downsampling and center cropping before handling the image over to the TFLite model. For some reason, we can't get the tracking to work.

To check whether detection was successful, we check the value in the success tensor and threshold it based on the value from the pbtxt (see this). For some reason, the confidence score is always extremely low, on the order of 1e-10 to 1e-16. Has anyone faced a similar issue? Any pointers?

fanzhanggoogle commented 4 years ago

Could you provide more detail of the issue? From what I understand is that the detector doesn't detect anything? The output of the model has 2944 anchors and we use TfLiteTensorsToDetectionsCalculator to decode the anchors and apply a sigmoid function on the score, maybe that is the issue?

lanreolokoba commented 4 years ago

This is the graph I'm referring to: https://github.com/google/mediapipe/blob/d144e564d8f48737f1bf684ee741c9ccf6a5909d/mediapipe/graphs/hand_tracking/subgraphs/hand_landmark_gpu.pbtxt#L49-L84

We are using the hand_landmark model as-is, feeding it image data to get the 21 landmark vectors. The second output tensor, which represents the success of the detection operation, is always a number very close to zero, no matter what. We're not sure what we could be missing. I appreciate any help you can provide.

fanzhanggoogle commented 4 years ago

Can you post an example image of your input? The input image should be a rotated hand at a canonical angle to get correct result. We use the keypoints (more specifically wrist and MCP of middle finger) in the hand detection to rotate the hand.

KetakiS14 commented 4 years ago

TFLite_input Yes, here is an example image we have tried. @fanzhanggoogle thank you for your help!

lanreolokoba commented 4 years ago

Can you post an example image of your input? The input image should be a rotated hand at a canonical angle to get correct result.

I'm a bit confused. It sounds like the hand_tracking.tflite doesn't detect hands in an image unless the hand is rotated to a canonical position. Is this correct? If so, how do you compute this rotation given an image?

We use the keypoints (more specifically wrist and MCP of middle finger) in the hand detection to rotate the hand.

We don't have key points; we're hoping to detect key points using the model itself (chicken and egg problem?). I guess on a broader level, what are the specific requirements of the model? The canonical rotation detail isn't mentioned in the model card or anywhere easily accessible. We're trying to build a minimal hand detection pipeline that takes in an image and spits out hand key points.

mcclanahoochie commented 4 years ago

have you seen the blog post about the mediapipe hand models ? https://ai.googleblog.com/2019/08/on-device-real-time-hand-tracking-with.html that will give you a better idea of how 2 models are used for hand tracking

lanreolokoba commented 4 years ago

have you seen the blog post about the mediapipe hand models ? https://ai.googleblog.com/2019/08/on-device-real-time-hand-tracking-with.html that will give you a better idea of how 2 models are used for hand tracking

Yes I have. Apparently the palm detection model does sparse detection so that the hand tracking model can be given the hand image that is axis-aligned. In the image @KetakiS14 posted above, the hand is already axis aligned (it's perfectly vertical). We still get almost zero confidence:

Screen Shot 2020-02-06 at 11 17 45 AM
fanzhanggoogle commented 4 years ago

Hi, I'm a bit confused here. If you are only running the hand landmark model, the input should be a crop of a single hand with 256x256. E.g. something look like

Screen Shot 2020-02-06 at 8 28 55 PM
lanreolokoba commented 4 years ago

Hi, I'm a bit confused here. If you are only running the hand landmark model, the input should be a crop of a single hand with 256x256. E.g. something look like

Screen Shot 2020-02-06 at 8 28 55 PM

Yes, we already did this. We've tested on our own hands and images of hands online, with still zero confidence.

fanzhanggoogle commented 4 years ago

Our landmark model is trained with unique canonical pose of a hand, so I would definitely recommend to try using the detector to rotate the hand to a good position where the landmark model can work correctly. The following image is a correct input rotated from the detector that should work. hand-rotated