matanox commented 5 months ago

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

Yes

OS Platform and Distribution

Ubuntu 22.04

MediaPipe Tasks SDK version

0.10.11

Task name and url

hand_landmarker, https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/1/hand_landmarker.task

Programming Language and version (e.g. C++, Python, Java)

python

Describe the actual behavior

Inference quality is not at par with the online demo

Describe the expected behaviour

Inference quality should be hopefully as good as in the online demo

Description

Using the hand landmarker task in python code, with the same threshold values and other task initialization values as the online demo's default values at https://mediapipe-studio.webapps.google.com/demo/hand_landmarker, it appears I am not getting the same success in a hand being detected in my python code as in that online demo. (Switching between cpu and gpu maintains the same observed gap between the two).

I haven't worked to change the online web example to work on recorded video and store its predictions and thus creating a reproducible example juxtaposing the predictions (of the web example and my python code for the same). Yet it seems others have noted a similar experience as well.

So while I don't expect much, lets see if this is a recurring issue or what comes up.

A difference between the web example and my python code, is that the web example uses a somewhat cropped/zoom region of my webcam's viewport, whereas in my code the full viewport size provided by my camera (90 degrees field of view) is the input. This difference in viewport size (angle) is I'd say about 15% only. It's also possible that the web example uses other webcam parameters differently, though nothing I can notice in terms of the looks of the images.

Happy for any advice, or a link to the online demo's source might be also helpful.

matanox commented 5 months ago

Since the online demo allows feeding in one image by upload, I could further see that the online demo is able to detect a (very clear) hand image that the same task initialization arguments don't detect when using the hand landmarker task from python.
Conversely, Using the legacy pre-v0.10.x api with mediapipe 0.10.11 still works, it just uses a different graph I believe (it is not using the task file). and its detection quality is as good as that of the online demo.

kuaashish commented 4 months ago

Hi @matanox,

Both platforms JavaScript and Python uses different types of graphs. In Python, the CPU graph is used by default, but switching to the GPU might give better performance. On the other hand, JavaScript uses the GPU graph, which usually leads to better performance.

Nevertheless, we acknowledge this anticipated behavior. At present, our ability to address this matter is limited.

Thank you!!

matanox commented 4 months ago

Hi @kuaashish,

I'm afraid that's both incorrect (you can select and switch to cpu also on the Javascript based web demo) and does not go any way to explain the observed difference. (Sorry).

matanox commented 4 months ago

Do you mean that the GPU graph is more accurate, or just that it's a bit faster? I had originally tested this propostion (only manually) by switching the web demo between cpu and gpu and could not see any difference in detection. Nor does the cpu option of the web demo demonstrate the same lack in detection power as the cpu option on linux, when using the same thersholds.

Maybe as a first step on it, it might possible be implied to check which model the web demo is actually using, and which computation graph, v.s. the linux distribution of the project.

(it may seem like something also demonstrated in the other linked issue; it implies that the python distribution might be broken in some way in v0.10.x relative to the quality of detection in previous versions, if the python version is any priority for the project right now).

The proposed underlying motivation being that in the new v0.10.x tasks api, hand landmarks detection is embarassingly less potent than both the web demo, and the pre-v0.10.x api, when using the same thresholds ...

kuaashish commented 4 months ago

Hi @matanox,

Following internal discussions with our team, we anticipate this behavior, and unfortunately, there is limited action we can take regarding this issue.

Thank you!!

matanox commented 4 months ago

@kuaashish thank you, it is really very much appreciated that you circled back to this despite the possible inconvenient nature of it.

I would just like to add that it would be highly appreciated, or just nice, if the ability to continue using the 'legacy' mediapipe api while using version 0.10.x would be preserved in the next releases, same as using the 'legacy' mediapipe api still works in version 0.10.x up until 0.10.11 at least.

I.e. 0.10.x can be seen as dual-headed: The api predating 0.10.x still works when importing 0.10.x using code like this:

# Copyright 2020 The MediaPipe Authors.
# this is source file mediapipe.solutions.hands.Hands ―
# modified by us to return more information for each frame:
# the various detection and region of interest rectangles

import numpy as np
from typing import NamedTuple
from mediapipe.python.solution_base import SolutionBase

compiled_pipeline_graph = 'mediapipe/modules/hand_landmark/hand_landmark_tracking_cpu.binarypb'

class HandsInference(SolutionBase):

  def __init__(self,
               static_image_mode=False,
               max_num_hands=2,
               model_complexity=1,
               min_detection_confidence=0.5,
               min_tracking_confidence=0.5):

    """ Initializes the MediaPipe Hands Pipeline.

    Args:
      static_image_mode: Whether to treat the input images as a batch of static
        and possibly unrelated images, or a video stream. See details in
        https://solutions.mediapipe.dev/hands#static_image_mode.
      max_num_hands: Maximum number of hands to detect. See details in
        https://solutions.mediapipe.dev/hands#max_num_hands.
      model_complexity: Complexity of the hand landmark model: 0 or 1.
        Landmark accuracy as well as inference latency generally go up with the
        model complexity. See details in
        https://solutions.mediapipe.dev/hands#model_complexity.
      min_detection_confidence: Minimum confidence value ([0.0, 1.0]) for hand
        detection to be considered successful. See details in
        https://solutions.mediapipe.dev/hands#min_detection_confidence.
      min_tracking_confidence: Minimum confidence value ([0.0, 1.0]) for the
        hand landmarks to be considered tracked successfully. See details in
        https://solutions.mediapipe.dev/hands#min_tracking_confidence.
    """

    super().__init__(
        binary_graph_path=compiled_pipeline_graph,
        side_inputs={
            'model_complexity': model_complexity,
            'num_hands': max_num_hands,
            'use_prev_landmarks': not static_image_mode,
        },
        calculator_params={
            'palmdetectioncpu__TensorsToDetectionsCalculator.min_score_thresh':
                min_detection_confidence,
            'handlandmarkcpu__ThresholdingCalculator.threshold':
                min_tracking_confidence,
        },

        # here below we have added the last three elements, making the pipeline return those extra
        # three elements from each call to `process()` for more visibility into the tracking process.
        # has the naming or meaning just changed? https://github.com/google/mediapipe/commit/b1f93b3b2785b5e056bc31b11342b660659688f6
        outputs=[
            'multi_hand_landmarks',
            'multi_hand_world_landmarks',
            'multi_handedness',
        ])

  def process(self, image: np.ndarray) -> NamedTuple:
    return super().process(input_data={'image': image})

I can only guess that 0.10 took a somewhat diverging path in its hand landmarks pipeline perhaps optimizing towards slightly different goals than the pre-0.10 pipeline, at the cost of lesser performance specifically in python, and using the old api enables the same good performance.

It's nice to keep this dual-headedness in follow-up releases, as it seems to maintain the detection performance when used from python at par with the javascript api for the same.

kuaashish commented 4 months ago

Hi @matanox,

You can now utilize the legacy solutions since we have addressed them in the latest release, version 0.10.13, which we recently released. Could you please give them a try and inform us of the status?

Thank you!!

matanox commented 4 months ago

Yes they are working in 0.10.13, same as they did in all 0.10.x versions. The so-called legacy api (and the online web demo) yield better detection results than the new tasks api which might be optimized to different goals.

kuaashish commented 4 months ago

Hi @matanox,

Yeah, We acknowledge your observations. However, we no longer maintain the legacy solutions, as our current priority is to enhance our Task API continuously.

Thank you!!

matanox commented 4 months ago

Yes, that's acknoweldged. Only a nice to have if this regression in the Task API hand tracking from python, relative to the legacy solution of hands tracking, would be alleviated. Either way, thank you.

kuaashish commented 4 months ago

Hi @matanox,

Can we mark this issue as resolved internally and close the issue status? There is not much more we can do about it at this stage, but we will continue to improve our task API over time.

Thank you!!

matanox commented 4 months ago

Well I'm not sure, as it is still open, the task API for hand detection is inferior (or catering for different optimization goals) than the old one and the online demo alike. Lets close it just becuase it's a "won't fix" situation if I understand correctly, though it leaves me a little bit uncomfortable in case the option to use the legacy hands computation graph is removed from future releases while this is still as it currently stands.

kuaashish commented 4 months ago

Hi @matanox,

You are correct. This issue is unlikely to be fixed. I will mark same it internally, as already communicated our focus is on improving our task APIs and we can not do much about the legacy solutions.

Thank you!!

google-ml-butler[bot] commented 4 months ago

Are you satisfied with the resolution of your issue? Yes No

google-ai-edge / mediapipe

online web demo and pre 0.10.x api yield better hand detection than their python tasks api equivalent #5334

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

OS Platform and Distribution

MediaPipe Tasks SDK version

Task name and url

Programming Language and version (e.g. C++, Python, Java)

Describe the actual behavior

Describe the expected behaviour

Description