google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://ai.google.dev/edge/mediapipe
Apache License 2.0
27.4k stars 5.15k forks source link

I can't correctly output the landmarks using the hand recognition module #5001

Closed CestbonWind closed 11 months ago

CestbonWind commented 11 months ago

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

None

OS Platform and Distribution

Linux Ubuntu 18.04 LTS

MediaPipe Tasks SDK version

0.10.8

Task name (e.g. Image classification, Gesture recognition etc.)

hand recognition

Programming Language and version (e.g. C++, Python, Java)

Python

Describe the actual behavior

Calling hand recognition framework based on GPU

Describe the expected behaviour

Get hand landmarks

Standalone code/steps you may have used to try to get what you need

def test(video_path):
    BaseOptions = mp.tasks.BaseOptions
    HandLandmarker = mp.tasks.vision.HandLandmarker
    HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
    VisionRunningMode = mp.tasks.vision.RunningMode

    options = HandLandmarkerOptions(
        base_options=BaseOptions(model_asset_path='PreTrainedModel/hand_landmarker.task',
                                 delegate=python.BaseOptions.Delegate.GPU),
        running_mode=VisionRunningMode.VIDEO,
        num_hands=2)
    landmarker = vision.HandLandmarker.create_from_options(options)

    cap = cv2.VideoCapture(video_path, cv2.CAP_ANY)
    pTime = 0
    while cap.isOpened():
        success, img = cap.read()
        if not success:
            break
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=img)
        frame_timestamp_ms = int(cap.get(cv2.CAP_PROP_POS_MSEC))
        result = landmarker.detect_for_video(mp_image, frame_timestamp_ms)

        # 计算fps值
        cTime = time.time()
        fps = 1 / (cTime - pTime)
        pTime = cTime
        cv2.putText(img, str(int(fps)), (70, 50), cv2.FONT_HERSHEY_PLAIN, 3, (255, 0, 0), 3)

    # 释放资源
    cap.release()
    cv2.destroyAllWindows()

Other info / Complete Logs

I used the code example in the official documentation for hand recognition, but encountered some problems. During the debugging process, I found that landmarks could not be recognized correctly. But there is no such problem using the gesture recognition module example. And calling the mp.solutions.mp_hands.Hands.process() method can also be correctly identified. But It can't use GPU to accelerate processing.
CestbonWind commented 11 months ago

There are my issue: I used the code example in the official documentation for hand recognition, but encountered some problems. During the debugging process, I found that landmarks could not be recognized correctly. But there is no such problem using the gesture recognition module example. And calling the mp.solutions.mp_hands.Hands.process() method can also be correctly identified. But It can't use GPU to accelerate processing.

I'm sorry that I wrote the question in the log incorrectly

CestbonWind commented 11 months ago

I have solved this problem, its error is due to the inclusion of a new parameter min_hand_presence_confidence in the new version, his default value is 0.5, resulting in fewer recognition results. There is no such parameter in the old version. If you change it to 0, you will get the same recognition rate as the old version.

But I still don't know what he actually means

google-ml-butler[bot] commented 11 months ago

Are you satisfied with the resolution of your issue? Yes No

matanox commented 6 months ago

@CestbonWind have you methodically tested that a value of zero is an equivalent to the detection performance of the previous versions? I have noticed that the new version (0.10.x) is less confident in detecting hands entering the scene, presumably from the documentation min_hand_detection_confidence seems to play the major part in that, being the one performing palm detection before any previous landmarks have been detected on a previous frame or when tracking was lost.

It seems that a value of zero however, makes the inference time of the pipeline very slow, I believe that value should be avoided. When no hand has been detected in the previous frame, the palm detection seems to take 150 milliseconds (±20) rather than its usual ~33 milliseconds that it would otherwise take. Setting it to 0.01 rather than to absolute zero, that performance spike is avoided.

min_hand_presence_confidence at least by its name, seems to have the same meaning as min_detection_confidence which was actually also there in pre 0.10.x. So it's not really a new argument as much as the other one probably is.

I have yet to make a minimally reproducing example about that latency issue on a separate issue.

matanox commented 6 months ago

Anyway, you can still use the old api even when using mediapipe 0.10.x. Out of the box, it's more accurate for some use cases.