google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://ai.google.dev/edge/mediapipe
Apache License 2.0
27.49k stars 5.15k forks source link

How to process the output of pose detection and pose landmark models in a standalone c++ project #3532

Closed UtsaChattopadhyay closed 1 year ago

UtsaChattopadhyay commented 2 years ago

I am working on building a standalone cpp project for detecting pose. In the project,I am using 2 of the mediapipe tflite models (pose_detection and pose_landmark), and the output dimensions of the models are attached below. For the pose detection, we have two outputs (2254, 12) and (2254, 1). What do these values correspond to, and how do we do the postprocessing on these values? On the mediapipe webpage, it says that the output of Pose detector is similar to Face detector + (human body center, radius, and rotation). Similarly, for Pose Landmarks, we have 5 outputs - (1,195), (1,1), (1,256,256,1), (1,64,54,39), and (1,117). As we have understood that (1,1) is a classifier, and (1,256,256,1) is a segmentation mask. However, the other 3 output values are not clear. Here, it says that the pose landmark model detects 33 landmarks in pixel and world space, where each landmark has 4 values (x,y,z, visibility). I am assuming that the shape should correspond to a total of 4 (x,y,z,visiblity) 33 (landmarks) 2 (pixel and world space). Can you please let me know how to make sense of these two model outputs, and also the post-processing related to them?

Pose Detection Output

pose_detection

Pose Landmark Output

pose_landmark

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 2 years ago

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] commented 2 years ago

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

JesperStenberg commented 2 years ago

Hi @UtsaChattopadhyay , did you figure this out?

UtsaChattopadhyay commented 2 years ago

Yes we did

On Fri, Sep 23, 2022, 03:39 JesperStenberg @.***> wrote:

Hi @UtsaChattopadhyay https://github.com/UtsaChattopadhyay , did you figure this out?

— Reply to this email directly, view it on GitHub https://github.com/google/mediapipe/issues/3532#issuecomment-1255468837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKS5W3JWFSY4WK5SJYNT643V7SYW7ANCNFSM54CDCVXQ . You are receiving this because you were mentioned.Message ID: @.***>

JesperStenberg commented 2 years ago

Yes we did On Fri, Sep 23, 2022, 03:39 JesperStenberg @.> wrote: Hi @UtsaChattopadhyay https://github.com/UtsaChattopadhyay , did you figure this out? — Reply to this email directly, view it on GitHub <#3532 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKS5W3JWFSY4WK5SJYNT643V7SYW7ANCNFSM54CDCVXQ . You are receiving this because you were mentioned.Message ID: @.>

Do you mind sharing it regarding the pose_landmark model? We get a [1, 195] output from "Identity". I've come to understand that this represent 39 arrays of [x, y, z, visibility, presence] (33 points on the body + 6 extra points for next frame tracking).

This works very well when the input scene is easy, x and y tracks perfectly to the image. But if the person is partially of screen the tracking fails completely, which it doesn't do in the Mediapipe examples.

Do you have any insights?

EinePriseCode commented 1 year ago

@JesperStenberg or @UtsaChattopadhyay , did one of you figure it out for the pose_landmark model? Where did you find the information about 33 + 6 landmarks? Are those 6 at the end or the beginning of the array?

JesperStenberg commented 1 year ago

It was a while ago and I don't have it available, but i'm pretty sure that those 6 are at the end of the array. The thing that threw me off was that the person needs to be centred in the image for the model to work.

If you haven't checked this link it might have some good info.

EinePriseCode commented 1 year ago

Thanks @JesperStenberg, that was an important hint. Unfortunately I cant find any doc which explains the output in more detail which makes implementing harder and less clean.

Lakshyadevelops commented 1 year ago

If we have a high enough frame rate could we find the acceleration and velocity of not only the center but the limbs also? Random 3 am idea hoping to get some feedback and minimum frame rate for usable results

kuaashish commented 1 year ago

Hello @UtsaChattopadhyay, We are upgrading the MediaPipe Legacy Solutions to new MediaPipe solutions However, the libraries, documentation, and source code for all the MediapPipe Legacy Solutions will continue to be available in our GitHub repository and through library distribution services, such as Maven and NPM.

You can continue to use those legacy solutions in your applications if you choose. Though, we would request you to check new MediaPipe solutions which can help you more easily build and customize ML solutions for your applications. These new solutions will provide a superset of capabilities available in the legacy solutions. Thank you

github-actions[bot] commented 1 year ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 year ago

This issue was closed due to lack of activity after being marked stale for past 7 days.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No