google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://mediapipe.dev
Apache License 2.0
26.77k stars 5.09k forks source link

TfLite C-API BlazePose Body Detector Output #2945

Open luca992 opened 2 years ago

luca992 commented 2 years ago

I am attempting to run inference with BlazePose Detector Tflite model using the tflite c api compiled for macOS x86_64

As far as I can tell my code is working. I am copying an image (scaled down to fit within a 224,244 rgb image with float values between 0.0-1.0) into an input tensor, invoking the interpreter, and copying the output tensors of shapes [1, 2254, 12], and [1, 2254, 1] into float arrays. I receive kTfLiteOk for all operations.

However, the output data doesn't really make much sense to me. Most of the values are negative. For example this is the first row of data from the tensor with shape [1, 2254, 12]: [-6.2709107, -5.0030107, -2.4196007, -9.151258, 4.2085056, 6.054624, -4.6846647, 7.785931, -9.291795, -7.1333103, -26.822824, 12.8004] and the corresponding first confidence score from the [1, 2254, 1] shaped tensor is [-69.88174]

I don't understand how it is possible for the x and y values -6.2709107, -5.0030107 from the first output tensor to be negative. Shouldn't they always be between 0-224.0 or 0-1.0? And I expected the confidence score to be positive as well. Am I miss-understanding something?

Also, another general question. Why are there 2254 different bounding boxes included in the output? Would it be correct to only use the bounding box with the highest confidence score?

luca992 commented 2 years ago

Okay, so I just tried a python implementation of tf lite and I still have the same output. So maybe I am just misunderstanding the format of the output? Any help or documentation would be appreciated.

luca992 commented 2 years ago

This is the only description I can find online about the output:

https://medium.com/axinc-ai/blazepose-a-3d-pose-estimation-model-d8689d06b7c4

Architecture

The Detector is an Single-Shot Detector(SSD) based architecture. Given an input image (1,224,224,3), 
it outputs a bounding box (1,2254,12) and a confidence score (1,2254,1). The 12 elements of the 
bounding box are of the form (x,y,w,h,kp1x,kp1y,…,kp4x,kp4y), where kp1x to kp4y are additional 
keypoints. Each one of the 2254 elements has its own anchor, anchor scale and offset need to be applied.

There are two ways to use the Detector. In box mode, the bounding box is determined from its position 
(x,y) and size (w,h). In alignment mode, the scale and angle are determined from (kp1x,kp1y) and 
(kp2x,kp2y), and bounding box including rotation can be predicted.

I think I might be starting to understand and I have more specific questions now:

  1. For box mode, where I utilize the first 4 elements x,y,w,h what is x and y relative to? Is it relative to the center of the 224x224 input image?
  2. For alignment mode, can I safely ignore key point 3 and 4?
  3. Using Alignment mode to crop and rotate images so that poses are always fed to the Pose landmark model with the head up would result in more accurate inference, correct?
  4. Why are there 2254 different bounding boxes included in the output?
  5. Would it be correct to only use the bounding box with the highest confidence score?
  6. What mode does the blazepose packaged in mlkit use, box or alignment?
Oghost commented 2 years ago

This is the only description I can find online about the output:

https://medium.com/axinc-ai/blazepose-a-3d-pose-estimation-model-d8689d06b7c4

Architecture

The Detector is an Single-Shot Detector(SSD) based architecture. Given an input image (1,224,224,3), 
it outputs a bounding box (1,2254,12) and a confidence score (1,2254,1). The 12 elements of the 
bounding box are of the form (x,y,w,h,kp1x,kp1y,…,kp4x,kp4y), where kp1x to kp4y are additional 
keypoints. Each one of the 2254 elements has its own anchor, anchor scale and offset need to be applied.

There are two ways to use the Detector. In box mode, the bounding box is determined from its position 
(x,y) and size (w,h). In alignment mode, the scale and angle are determined from (kp1x,kp1y) and 
(kp2x,kp2y), and bounding box including rotation can be predicted.

I think I might be starting to understand and I have more specific questions now:

  1. For box mode, where I utilize the first 4 elements x,y,w,h what is x and y relative to? Is it relative to the center of the 224x224 input image?
  2. For alignment mode, can I safely ignore key point 3 and 4?
  3. Using Alignment mode to crop and rotate images so that poses are always fed to the Pose landmark model with the head up would result in more accurate inference, correct?
  4. Why are there 2254 different bounding boxes included in the output?
  5. Would it be correct to only use the bounding box with the highest confidence score?
  6. What mode does the blazepose packaged in mlkit use, box or alignment?

Hi Luca, did you finally solve it? I'm also stuck with that post process

luca992 commented 2 years ago

@Oghost I think I figured out most of them... But I haven't looked at it in a while. So I don't remember the details. But I have working code (written in kotlin) if you want to look at it, send me an email in my bio. I don't think I want to post it here publicly, it's a little rough haha.

isonull commented 2 years ago

Hello there, I am also looking for an explanation for BlazePose heavy tflite model. Did anyone get links or hints?

[{'name': 'input_1', 'index': 0, 'shape': array([ 1, 256, 256, 3], dtype=int32), 'shape_signature': array([ 1, 256, 256, 3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}] [{'name': 'Identity', 'index': 696, 'shape': array([ 1, 195], dtype=int32), 'shape_signature': array([ 1, 195], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}, {'name': 'Identity_1', 'index': 701, 'shape': array([1, 1], dtype=int32), 'shape_signature': array([1, 1], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}, {'name': 'Identity_2', 'index': 565, 'shape': array([ 1, 256, 256, 1], dtype=int32), 'shape_signature': array([ 1, 256, 256, 1], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}, {'name': 'Identity_3', 'index': 641, 'shape': array([ 1, 64, 64, 39], dtype=int32), 'shape_signature': array([ 1, 64, 64, 39], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}, {'name': 'Identity_4', 'index': 698, 'shape': array([ 1, 117], dtype=int32), 'shape_signature': array([ 1, 117], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0), 'quantization_parameters': {'scales': array([], dtype=float32), 'zero_points': array([], dtype=int32), 'quantized_dimension': 0}, 'sparsity_parameters': {}}] {}

kuaashish commented 1 year ago

Hello @luca992,

Are you still looking for resolution on this Or the issue has been resolved? Thank you!!

luca992 commented 1 year ago

Hello @luca992,

Are you still looking for resolution on this Or the issue has been resolved? Thank you!!

I mean you guys did absolutely nothing to add better documentation on how the output should be parsed... So I would say no this has not been resolved.