Open luca992 opened 2 years ago
Okay, so I just tried a python implementation of tf lite and I still have the same output. So maybe I am just misunderstanding the format of the output? Any help or documentation would be appreciated.
This is the only description I can find online about the output:
https://medium.com/axinc-ai/blazepose-a-3d-pose-estimation-model-d8689d06b7c4
Architecture
The Detector is an Single-Shot Detector(SSD) based architecture. Given an input image (1,224,224,3),
it outputs a bounding box (1,2254,12) and a confidence score (1,2254,1). The 12 elements of the
bounding box are of the form (x,y,w,h,kp1x,kp1y,…,kp4x,kp4y), where kp1x to kp4y are additional
keypoints. Each one of the 2254 elements has its own anchor, anchor scale and offset need to be applied.
There are two ways to use the Detector. In box mode, the bounding box is determined from its position
(x,y) and size (w,h). In alignment mode, the scale and angle are determined from (kp1x,kp1y) and
(kp2x,kp2y), and bounding box including rotation can be predicted.
I think I might be starting to understand and I have more specific questions now:
x,y,w,h
what is x and y relative to? Is it relative to the center of the 224x224
input image? This is the only description I can find online about the output:
https://medium.com/axinc-ai/blazepose-a-3d-pose-estimation-model-d8689d06b7c4
Architecture The Detector is an Single-Shot Detector(SSD) based architecture. Given an input image (1,224,224,3), it outputs a bounding box (1,2254,12) and a confidence score (1,2254,1). The 12 elements of the bounding box are of the form (x,y,w,h,kp1x,kp1y,…,kp4x,kp4y), where kp1x to kp4y are additional keypoints. Each one of the 2254 elements has its own anchor, anchor scale and offset need to be applied. There are two ways to use the Detector. In box mode, the bounding box is determined from its position (x,y) and size (w,h). In alignment mode, the scale and angle are determined from (kp1x,kp1y) and (kp2x,kp2y), and bounding box including rotation can be predicted.
I think I might be starting to understand and I have more specific questions now:
- For box mode, where I utilize the first 4 elements
x,y,w,h
what is x and y relative to? Is it relative to the center of the224x224
input image?- For alignment mode, can I safely ignore key point 3 and 4?
- Using Alignment mode to crop and rotate images so that poses are always fed to the Pose landmark model with the head up would result in more accurate inference, correct?
- Why are there 2254 different bounding boxes included in the output?
- Would it be correct to only use the bounding box with the highest confidence score?
- What mode does the blazepose packaged in mlkit use, box or alignment?
Hi Luca, did you finally solve it? I'm also stuck with that post process
@Oghost I think I figured out most of them... But I haven't looked at it in a while. So I don't remember the details. But I have working code (written in kotlin) if you want to look at it, send me an email in my bio. I don't think I want to post it here publicly, it's a little rough haha.
Hello there, I am also looking for an explanation for BlazePose heavy tflite model. Did anyone get links or hints?
[{'name': 'input_1', 'index': 0, 'shape': array([ 1, 256, 256, 3], dtype=int32), 'shape_signature': array([ 1, 256, 256, 3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}] [{'name': 'Identity', 'index': 696, 'shape': array([ 1, 195], dtype=int32), 'shape_signature': array([ 1, 195], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}, {'name': 'Identity_1', 'index': 701, 'shape': array([1, 1], dtype=int32), 'shape_signature': array([1, 1], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}, {'name': 'Identity_2', 'index': 565, 'shape': array([ 1, 256, 256, 1], dtype=int32), 'shape_signature': array([ 1, 256, 256, 1], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}, {'name': 'Identity_3', 'index': 641, 'shape': array([ 1, 64, 64, 39], dtype=int32), 'shape_signature': array([ 1, 64, 64, 39], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}, {'name': 'Identity_4', 'index': 698, 'shape': array([ 1, 117], dtype=int32), 'shape_signature': array([ 1, 117], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}] {}
Hello @luca992,
Are you still looking for resolution on this Or the issue has been resolved? Thank you!!
Hello @luca992,
Are you still looking for resolution on this Or the issue has been resolved? Thank you!!
I mean you guys did absolutely nothing to add better documentation on how the output should be parsed... So I would say no this has not been resolved.
I am attempting to run inference with BlazePose Detector Tflite model using the tflite c api compiled for macOS x86_64
As far as I can tell my code is working. I am copying an image (scaled down to fit within a 224,244 rgb image with float values between 0.0-1.0) into an input tensor, invoking the interpreter, and copying the output tensors of shapes
[1, 2254, 12]
, and[1, 2254, 1]
into float arrays. I receivekTfLiteOk
for all operations.However, the output data doesn't really make much sense to me. Most of the values are negative. For example this is the first row of data from the tensor with shape
[1, 2254, 12]
:[-6.2709107, -5.0030107, -2.4196007, -9.151258, 4.2085056, 6.054624, -4.6846647, 7.785931, -9.291795, -7.1333103, -26.822824, 12.8004]
and the corresponding first confidence score from the[1, 2254, 1]
shaped tensor is[-69.88174]
I don't understand how it is possible for the x and y values
-6.2709107, -5.0030107
from the first output tensor to be negative. Shouldn't they always be between 0-224.0 or 0-1.0? And I expected the confidence score to be positive as well. Am I miss-understanding something?Also, another general question. Why are there 2254 different bounding boxes included in the output? Would it be correct to only use the bounding box with the highest confidence score?