google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://mediapipe.dev
Apache License 2.0
26.8k stars 5.09k forks source link

Accuracy in new Tasks is much lower than older Solutions #5559

Closed duncanhall closed 1 month ago

duncanhall commented 1 month ago

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

Yes

OS Platform and Distribution

macOS Ventura 13.2.1

MediaPipe Tasks SDK version

0.10.14

Task name (e.g. Image classification, Gesture recognition etc.)

Hand Landmarker

Programming Language and version (e.g. C++, Python, Java)

Python

Describe the actual behavior

Using HandLandmarker.detect() gives poor results compared to mediapipe.solutions.hands.process()

Describe the expected behaviour

The task detect / detect_async methods should return equal or better results than the old solutions

Standalone code/steps you may have used to try to get what you need

The first example below uses mediapipe.solutions.hands to detect multiple hands at once and draw the landmarks on top of the video frame. The results are very good, with multiple hands being recognized and the landmarks applied with constant high accuracy

import mediapipe as mp
import cv2

mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands
capture = cv2.VideoCapture(0)

with mp_hands.Hands(min_detection_confidence=0.8, min_tracking_confidence=0.5) as hands:
  while capture.isOpened():
      ret, frame = capture.read()
      frame = cv2.flip(frame, 1)
      image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
      detected_image = hands.process(image)
      image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

      if detected_image.multi_hand_landmarks:
          for hand_lms in detected_image.multi_hand_landmarks:
              mp_drawing.draw_landmarks(image, hand_lms,
                                        mp_hands.HAND_CONNECTIONS,
                                        landmark_drawing_spec=mp.solutions.drawing_utils.DrawingSpec(
                                            color=(255, 0, 255), thickness=4, circle_radius=2),
                                        connection_drawing_spec=mp.solutions.drawing_utils.DrawingSpec(
                                            color=(20, 180, 90), thickness=2, circle_radius=2)
                                        )

      cv2.imshow('Webcam', image)

      if cv2.waitKey(1) & 0xFF == ord('q'):
          break

capture.release()
cv2.destroyAllWindows()

The 2nd example below uses HandLandmarker.detect to get the landmarker results before drawing them. While the drawing method is slightly different in this example, the actual results are very poor. Results from HandLandmarker.detect very often detect no hands at all, mostly just showing a flicker of the landmarks before showing no landmarks at all again. I have never seen this method to detect more than 1 hand.


import mediapipe as mp
from mediapipe import solutions
from mediapipe.framework.formats import landmark_pb2
import numpy as np
import cv2 

MARGIN = 10  # pixels
FONT_SIZE = 1
FONT_THICKNESS = 1
HANDEDNESS_TEXT_COLOR = (88, 205, 54) # vibrant green

mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands

BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

def draw_hand_landmarks_on_image(rgb_image, detection_result):
  hand_landmarks_list = detection_result.hand_landmarks
  handedness_list = detection_result.handedness
  annotated_image = np.copy(rgb_image)

  # Loop through the detected hands to visualize.
  for idx in range(len(hand_landmarks_list)):
    hand_landmarks = hand_landmarks_list[idx]
    handedness = handedness_list[idx]

    # Draw the hand landmarks.
    hand_landmarks_proto = landmark_pb2.NormalizedLandmarkList()
    hand_landmarks_proto.landmark.extend([
      landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y, z=landmark.z) for landmark in hand_landmarks
    ])
    solutions.drawing_utils.draw_landmarks(
      annotated_image,
      hand_landmarks_proto,
      solutions.hands.HAND_CONNECTIONS,
      solutions.drawing_styles.get_default_hand_landmarks_style(),
      solutions.drawing_styles.get_default_hand_connections_style())

    # Get the top left corner of the detected hand's bounding box.
    height, width, _ = annotated_image.shape
    x_coordinates = [landmark.x for landmark in hand_landmarks]
    y_coordinates = [landmark.y for landmark in hand_landmarks]
    text_x = int(min(x_coordinates) * width)
    text_y = int(min(y_coordinates) * height) - MARGIN

    # Draw handedness (left or right hand) on the image.
    cv2.putText(annotated_image, f"{handedness[0].category_name}",
                (text_x, text_y), cv2.FONT_HERSHEY_DUPLEX,
                FONT_SIZE, HANDEDNESS_TEXT_COLOR, FONT_THICKNESS, cv2.LINE_AA)

  return annotated_image

options = HandLandmarkerOptions(
  base_options=BaseOptions(model_asset_path='tasks/hand_landmarker.task'),
  num_hands = 2,
  min_hand_detection_confidence = 0.8,
  min_tracking_confidence = 0.5,
  min_hand_presence_confidence = 0.5
)

video = cv2.VideoCapture(0)

with HandLandmarker.create_from_options(options) as hand_landmarker:
  while video.isOpened():
    ret, frame = video.read() 
    frame = cv2.flip(frame, 1)

    mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame)
    result = hand_landmarker.detect(mp_image)
    a_img = draw_hand_landmarks_on_image(mp_image.numpy_view(), result)

    cv2.imshow('Webcam', a_img)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

video.release()
cv2.destroyAllWindows()

Not only would I expect HandLandmarker.detect to produce better results, this also makes it very hard to use mp.tasks.vision.RunningMode.LIVE_STREAM as suggested in the latest docs.

Some of this may be due to docs, source code and guides not keeping in sync with each other, but so far I'm unable to get the quality of results seen in the old solutions using the latest methods suggested?

duncanhall commented 1 month ago

In the second example I had forgotten to change the color space to RGB before processing the frame 🤦

With the correct color space and tweaking of confidence levels I'm able to get the expected results with the HandLandmarker.detect