maxbbraun / thermal-face

Fast face detection in thermal images
MIT License
68 stars 13 forks source link

Inference with uncompiled tflite model, format of output #15

Closed Michaelszeng closed 3 years ago

Michaelszeng commented 3 years ago

Hi,

I'm trying to perform inference using the uncompiled model thermal_face_automl_edge_fast.tflite and the TF Lite API as linked, as I'm trying to run the code on my Windows computer. The output from running interpreter.get_tensor(...) appears to be a 1 by 500 by 4 array of floats (in addition, interpreter.get_output_details()[0]['shape'] returns [1 500 4]). I'm not sure how to convert this (1, 500, 4) array into bounding box(es) for face detection?

maxbbraun commented 3 years ago

Hi Michael!

Have a look at the model metadata in the release notes, particularly the outputTensorRepresentation. You can also look at the code for DetectionEngine, which parses this kind of tensor.

Michaelszeng commented 3 years ago

Thank you for the quick response!

I'm not too sure how to use the code in DetectionEngine. outputTensorRepresentation in the release notes looks like probably what I need, could you help explain a little further what it means? I see "maxDetections": 500 but I'm not quite sure what that means. The output tensor simply contains float numbers, so I'm not sure how that relates to "bounding_boxes", "class_labels", "class_confidences", and "num_of_boxes", which are the labels under outputTensorRepresentation.

I hope I am not missing something simple--I am a student coder. I really appreciate the help!

maxbbraun commented 3 years ago

Sure! I was suggesting looking at DetectionEngine not necessarily to use that code but to see how they get the bounding boxes from the tensor. The model is one that supports detection as well as classification, so you can ignore the class label since it will always be the same. A quick reading of the code suggests that the coordinates of the bounding boxes are encoded as successive 4-tuples of floats where 1 means the full image width or height.

Michaelszeng commented 3 years ago

Thanks for the reply! Could you explain what you mean by "where 1 means the full image width or height"?

maxbbraun commented 3 years ago

It looked to me like the bounding box coordinates are relative to the image size, with [0, 0] being the top left and [1, 1] being the bottom right, so you'd have to translate them back into pixels by multiplying the x and y values by width and height, respectively.

Michaelszeng commented 3 years ago

I see, thank you. Do you know why there's 500 sets of 4-tuples? I tried creating a bounding box using the first 4-tuple in the way you described, and it doesn't really seem correct. The bounding box doesn't bound my face most of the time--it floats around in space.

maxbbraun commented 3 years ago

Could you post the raw tensor output you're seeing?

Michaelszeng commented 3 years ago

Yes, thanks for the reply. I made a real-time version of the code that attempts to draw a bounding box on a live video feed. As I played with the code a bit more and it occurred to me that the bounding box being drawn seemed to have some relationship with the position of my face, just not the correct one.

I tried a few more combinations of (X, Y) coordinates from the raw tensor output data, and realized that the bounding box is correct if I use this combination:

h, w, ch = image.shape y1 = int(results[0][0] * w) x1 = int(results[0][1] * h) y2 = int(results[0][2] * w) x2 = int(results[0][3] * h) cv2.rectangle(image, (x1, y1), (x2, y2), (255, 0, 0), 0)

I believe this means each row of the output array represents a detection (the 1st row is the highest confidence detection), and the 4 columns represent upper left corner Y, upper left corner X, lower right corner Y, lower right corner X, in that order.

That ended up being quite simple, I thought I had tried this already. Thank you for this help!

Michaelszeng commented 3 years ago

I do have one more question. The multi-person detection works really well, however, all 500 detections in the raw output tensor contain actual coordinates, but they don't contain any confidence levels. If, for example, there are 2 people in frame, the first 2 detections in the array very accurately bound the 2 people's faces, but the other 498 detections capture random background details. Is there any way to distinguish between detections of faces, and "filler" detections of background details?

maxbbraun commented 3 years ago

This suggests that part of the output tensor contains the number of detections. I assume that only that many bounding boxes are valid and the rest are noise.

Michaelszeng commented 3 years ago

Hi, thank you for the response! I see how it works with DetectionEngine. However, I'm using the TensorflowLite API; do you know how I can achieve the same thing with TensorflowLite API?

For reference, this is my current code to run an image though the model and retrieve the output: `interpreter = tflite.Interpreter(model_path=args.model_file) interpreter.allocate_tensors()

input_details = interpreter.get_input_details() output_details = interpreter.get_output_details() input_shape = input_details[0]['shape'] output_shape = output_details[0]['shape']

height = input_shape[1] width = input_shape[2] img = Image.open(args.image).convert('RGB').resize((width, height))

input_data = np.expand_dims(img, axis=0) interpreter.set_tensor(input_details[0]['index'], input_data) interpreter.invoke()

output_data = interpreter.get_tensor(output_details[0]['index'])`

output_data is an array with shape (1, 500, 4), so I'm not sure where to find the number of "candidates". Do you know how I could achieve the equivalent of num_candidates = raw_result[self._tensor_start_index[3]] using the TFLite API?

Thanks.