vietanhdev / tf-blazepose

BlazePose - Super fast human pose detection on Tensorflow 2.x
194 stars 38 forks source link

Inconsistent with the original Blazepose full model #3

Open jizhu1023 opened 3 years ago

jizhu1023 commented 3 years ago

Thanks for your great work! I have some questions about the network structure. By comparing blazepose_full.py with the visualization image of the original tflite model, I found some differences. First, your implementation omitted the "identity_1" output in the original tflite model. Second, the "identity_2" output size is 156, i. e. 4 (33+6), but the corresponding output size in your implementation is 99, i.e. 3 33. Why your implementation is inconsistent with the original model in these aspects? And why the joints output size is 156 in the original model? Many thanks in advance.

vietanhdev commented 3 years ago

Hello, Our implementation is a modified version of the original model. First, for identity_1, we don't know the exact purpose of this branch. This architecture is for tracking, so we guess that this branch predicts if there is a person in the image. I also verified this assumption by running the pre-trained model with the following code:

import tensorflow as tf
import cv2
import numpy as np

model = tf.keras.models.load_model('saved_model_full_pose_landmark_39kp')
cap = cv2.VideoCapture(0)

while True:
    _, origin = cap.read()
    img = cv2.resize(origin, (256, 256))
    img = img.astype(float)
    img = (img - 127) / 255
    img = np.array([img])

    heatmap, classify, regress = model.predict(img)
    confidence = np.reshape(classify, (1,))[0]
    print(confidence)

For identity_2, as explained here, they have 4 outputs for each keypoints:

x and y: Landmark coordinates normalized to [0.0, 1.0] by the image width and height respectively.
z: Should be discarded as currently the model is not fully trained to predict depth, but this is something on the roadmap.
visibility: A value in [0.0, 1.0] indicating the likelihood of the landmark being visible (present and not occluded) in the image.

That's why their output size is 4 number_of_keypoints. In the pre-trained model we used to implement this repo, number_of_keypoints = 39, so we have 4 39 = 156 outputs. I removed z dimension from keypoints, the shape of the output is 3 * number_of_keypoints.

Another difference between our model and the original model is that the heatmap output of our model has the shape of (128, 128, number_of_keypoints) while the original model only has the shape of (128, 128, 1). We are using output from heatmap for the keypoints. In the future, we will modify this design.

jizhu1023 commented 3 years ago

@vietanhdev Thanks for your reply, which address my issues well! The other thing I am confused about is why there are 39 keypoints but not 33 or 35 keypoints as mentioned by the paper? By looking into the code in Mediapipe, I found the keypoint 34-35 are auxiliary_landmarks for ROI generation and keypoint 36-39 are not used. I further visualize the locations of keypoint 36-39 and found they are the same with some keypoints on hands.