Open jizhu1023 opened 3 years ago
Hello,
Our implementation is a modified version of the original model.
First, for identity_1
, we don't know the exact purpose of this branch. This architecture is for tracking, so we guess that this branch predicts if there is a person in the image. I also verified this assumption by running the pre-trained model with the following code:
import tensorflow as tf
import cv2
import numpy as np
model = tf.keras.models.load_model('saved_model_full_pose_landmark_39kp')
cap = cv2.VideoCapture(0)
while True:
_, origin = cap.read()
img = cv2.resize(origin, (256, 256))
img = img.astype(float)
img = (img - 127) / 255
img = np.array([img])
heatmap, classify, regress = model.predict(img)
confidence = np.reshape(classify, (1,))[0]
print(confidence)
For identity_2
, as explained here, they have 4 outputs for each keypoints:
x and y: Landmark coordinates normalized to [0.0, 1.0] by the image width and height respectively.
z: Should be discarded as currently the model is not fully trained to predict depth, but this is something on the roadmap.
visibility: A value in [0.0, 1.0] indicating the likelihood of the landmark being visible (present and not occluded) in the image.
That's why their output size is 4 number_of_keypoints. In the pre-trained model we used to implement this repo, number_of_keypoints = 39, so we have 4 39 = 156 outputs. I removed z
dimension from keypoints, the shape of the output is 3 * number_of_keypoints.
Another difference between our model and the original model is that the heatmap output of our model has the shape of (128, 128, number_of_keypoints) while the original model only has the shape of (128, 128, 1). We are using output from heatmap for the keypoints. In the future, we will modify this design.
@vietanhdev Thanks for your reply, which address my issues well! The other thing I am confused about is why there are 39 keypoints but not 33 or 35 keypoints as mentioned by the paper? By looking into the code in Mediapipe, I found the keypoint 34-35 are auxiliary_landmarks for ROI generation and keypoint 36-39 are not used. I further visualize the locations of keypoint 36-39 and found they are the same with some keypoints on hands.
Thanks for your great work! I have some questions about the network structure. By comparing blazepose_full.py with the visualization image of the original tflite model, I found some differences. First, your implementation omitted the "identity_1" output in the original tflite model. Second, the "identity_2" output size is 156, i. e. 4 (33+6), but the corresponding output size in your implementation is 99, i.e. 3 33. Why your implementation is inconsistent with the original model in these aspects? And why the joints output size is 156 in the original model? Many thanks in advance.