google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://ai.google.dev/edge/mediapipe
Apache License 2.0
27.42k stars 5.15k forks source link

Any way to get all landmarks in one coordinate system? #4135

Open Dillxn opened 1 year ago

Dillxn commented 1 year ago

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

Yes

OS Platform and Distribution

Windows 10, Unity Engine

MediaPipe version

0.9.1

Bazel version

No response

Solution

Holistic

Programming Language and version

C#

Describe the actual behavior

When polling the graph for hand_world_landmarks and pose_world_landmarks they are returned in two different coordinate systems. One is centered around the body, the other is centered around the hand.

Describe the expected behaviour

I would like to get the landmarks in the same coordinate system so I can rig a 3d model cohesively.

Standalone code/steps you may have used to try to get what you need

No response

Other info / Complete Logs

No response

dantonioangela commented 1 year ago

Hi @Dillxn,

Can you share how did you get the hand_world_landmarks in the holistic solution?I would like to try it too but I don't know where to start, I'm using the MediapipeUnityPlugin to get the solution in C#.

eldog commented 1 year ago

You can use solvePnP in opencv to get the coordinates in world space, see my comment here for it working in python https://github.com/google/mediapipe/issues/2199#issuecomment-1443493391

It basically takes the "landmarks", which are in the image coordiantes and "world landmarks" which are 3D but not in world space, and finds the camera position. You can then flip the camera position it finds to place the world landmarks into a 3D space that matches up with the camera.

I don't know how to get solvepnp into unity, but maybe this c# port of opencv would help https://www.emgu.com/wiki/index.php/Main_Page

kuaashish commented 1 year ago

@Dillxn, As from the description, Currently we do not have any solution related to C# available in mediapipe. Could you please provide the complete details about the issue and platform you are using such as Android, C++, Python or other solutions available. Thank you!

Dillxn commented 1 year ago

Hi @kuaashish! I am using a plugin to wrap MediaPipe in a C# layer so it can be used in Unity. The plugin is found here. I'm really just curious if there is any calculator (or combination thereof) I can use in my config file to get all the world landmarks in the same coordinate system. (In other words, they all share the same 0,0,0; they all share the same world space...)

Here is my current graph config:

# The following file is modified by ASL XR Team, WKU
# Copyright 2019 The MediaPipe Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Copied from mediapipe/graphs/holistic_tracking/holistic_tracking_gpu.pbtxt
#
# CHANGES:
#   - Add ImageTransformationCalculator and rotate the input
#   - Remove AnnotationOverlayCalculator

# Tracks and renders pose + hands + face landmarks.

# CPU image. (ImageFrame)
input_stream: "input_video"

output_stream: "pose_landmarks"
output_stream: "pose_world_landmarks"
output_stream: "segmentation_mask"
output_stream: "pose_roi"
output_stream: "pose_detection"
output_stream: "face_landmarks"
output_stream: "left_hand_landmarks"
output_stream: "right_hand_landmarks"
output_stream: "left_hand_world_landmarks"
output_stream: "right_hand_world_landmarks"

# Throttles the images flowing downstream for flow control. It passes through
# the very first incoming image unaltered, and waits for downstream nodes
# (calculators and subgraphs) in the graph to finish their tasks before it
# passes through another image. All images that come in while waiting are
# dropped, limiting the number of in-flight images in most part of the graph to
# 1. This prevents the downstream nodes from queuing up incoming images and data
# excessively, which leads to increased latency and memory usage, unwanted in
# real-time mobile applications. It also eliminates unnecessarily computation,
# e.g., the output produced by a node may get dropped downstream if the
# subsequent nodes are still busy processing previous inputs.
node {
  calculator: "FlowLimiterCalculator"
  input_stream: "input_video"
  input_stream: "FINISHED:face_landmarks"
  input_stream_info: {
    tag_index: "FINISHED"
    back_edge: true
  }
  output_stream: "throttled_input_video"
  node_options: {
    [type.googleapis.com/mediapipe.FlowLimiterCalculatorOptions] {
      max_in_flight: 1
      max_in_queue: 1
      # Timeout is disabled (set to 0) as first frame processing can take more
      # than 1 second.
      in_flight_timeout: 0
    }
  }
}

node: {
  calculator: "ImageTransformationCalculator"
  input_stream: "IMAGE:throttled_input_video"
  input_side_packet: "ROTATION_DEGREES:input_rotation"
  input_side_packet: "FLIP_HORIZONTALLY:input_horizontally_flipped"
  input_side_packet: "FLIP_VERTICALLY:input_vertically_flipped"
  output_stream: "IMAGE:transformed_input_video"
}

node {
  calculator: "HolisticLandmarkCpu"
  input_stream: "IMAGE:transformed_input_video"
  input_side_packet: "MODEL_COMPLEXITY:model_complexity"
  input_side_packet: "SMOOTH_LANDMARKS:smooth_landmarks"
  input_side_packet: "REFINE_FACE_LANDMARKS:refine_face_landmarks"
  input_side_packet: "ENABLE_SEGMENTATION:enable_segmentation"
  input_side_packet: "SMOOTH_SEGMENTATION:smooth_segmentation"
  output_stream: "POSE_LANDMARKS:pose_landmarks"
  output_stream: "WORLD_LANDMARKS:pose_world_landmarks"
  output_stream: "SEGMENTATION_MASK:segmentation_mask_rotated"
  output_stream: "POSE_ROI:pose_roi"
  output_stream: "POSE_DETECTION:pose_detection"
  output_stream: "FACE_LANDMARKS:face_landmarks"
}

# Predicts left and right hand landmarks based on the initial pose landmarks.
node {
  calculator: "HandLandmarksLeftAndRightCpu"
  input_stream: "IMAGE:transformed_input_video"
  input_stream: "POSE_LANDMARKS:pose_landmarks"
  output_stream: "LEFT_HAND_LANDMARKS:left_hand_landmarks"
  output_stream: "RIGHT_HAND_LANDMARKS:right_hand_landmarks"
  output_stream: "LEFT_HAND_ROI_FROM_POSE:left_hand_roi_from_pose"
  output_stream: "RIGHT_HAND_ROI_FROM_POSE:right_hand_roi_from_pose"
}

# left hand world landmarks
node {
  calculator: "HandLandmarkCpu"
  input_stream: "IMAGE:transformed_input_video"
  input_stream: "ROI:left_hand_roi_from_pose"
  input_side_packet: "MODEL_COMPLEXITY:model_complexity"
  output_stream: "WORLD_LANDMARKS:left_hand_world_landmarks"
}

# right hand world landmarks
node {
  calculator: "HandLandmarkCpu"
  input_stream: "IMAGE:transformed_input_video"
  input_stream: "ROI:right_hand_roi_from_pose"
  input_side_packet: "MODEL_COMPLEXITY:model_complexity"
  output_stream: "WORLD_LANDMARKS:right_hand_world_landmarks"
}

node: {
  calculator: "ImageTransformationCalculator"
  input_stream: "IMAGE:segmentation_mask_rotated"
  input_side_packet: "ROTATION_DEGREES:output_rotation"
  input_side_packet: "FLIP_HORIZONTALLY:output_horizontally_flipped"
  input_side_packet: "FLIP_VERTICALLY:output_vertically_flipped"
  output_stream: "IMAGE:segmentation_mask"
}
kuaashish commented 1 year ago

Hi @ivan-grishchenko, Could you please look into this issue? Thank you!