Exported graph from MacOS causes reversed "logit_scores" and "box_encodings"

di-jabil commented 5 years ago

Hi All,

As you may have seen in the issue #563, I have been struggling to load a retrained object detection model to the AIY kit. After two days of struggling, I think I have found something really interesting:

(Please refer to #563 for more background information. Here I would like to have cleaned up summary)

First off: I am using a MacBook with macOS 10.12.6 for most of the operations described below, except for the compiling. The compiler is on a virtual machine Ubuntu 64bit 18.04.1. The AIY image is AIY Kits Release 2018-11-16

Initial Scenario: I am following the tutorial on the AIY homepage to create a custom object detection project.

Some key points:

aim to train an embedded_ssd_mobilenet_v1 from scratch to detect a paper shooting target (see the attached example)
started from the PASCAL VOC dataset; trained a few iterations locally and then moved to Google Cloud ML; generated a checkpoint at about the 870k step
from the checkpoint, trained the model on a self-created dataset of the paper target.

The training went well. The exporting went well too. I was able to run local evaluations with the exported graph (the local evaluations were on the Mac too).

I then moved the frozen graph to the Ubuntu virtual machine. The compiling had no problem.

Then I used scp to load the compiled binaryproto to the AIY hardware. First tried the any_model_camera.py and had no problem.

Then moved the binaryproto file to /opt/aiy/models/ and modified the ~/AIY-projects-python/src/aiy/vision/models/object_detection.py to load this binaryproto and to reflect the new number of labels (2, instead of 4)

Please refer to #563 for more details on the code changes.

Basically following the instructions in this blog https://cogint.ai/custom-vision-training-on-the-aiy-vision-kit/

Then I ran the object detection demo and received the AssertionError reported in #563: Traceback (most recent call last): File "/home/pi/AIY-projects-python/src/examples/vision/object_detection.py", line 73, in main() File "/home/pi/AIY-projects-python/src/examples/vision/object_detection.py", line 59, in main objects = object_detection.get_objects(result, args.threshold, offset) File "/opt/aiy/projects-python/src/aiy/vision/models/object_detection.py", line 269, in get_objects objs = _decode_detection_result(logit_scores, box_encodings, threshold, size, offset) File "/opt/aiy/projects-python/src/aiy/vision/models/object_detection.py", line 88, in _decode_detection_result assert len(logit_scores) == _NUM_LABELS * _NUM_ANCHORS AssertionError

Error Basically, the number of anchors is always 1278 (loaded from the txt file). According to the blog https://cogint.ai/custom-vision-training-on-the-aiy-vision-kit/, the number of logit_scores is supposed to be the number of labels times the number of anchors, and the number of box_encodings should remain the same as 4 times the number of anchors.

So supposedly, the numbers should have been: number of anchors = 1278 number of logit_scores = 1278 2 (2 labels, target and background ) = 2556 number of box_encodings = 1278 4 = 5112

But my print statement shows that in my case the number of logit_scores and the number of box_encodings got reversed: number of anchors = 1278 number of logit_scores = 5112 number of box_encodings = 2556

What I Did Then Because only little was changed on the AIY side, I was suspecting that something wrong with my model. I saw the whole process as train->export->compile->run. Therefore I planned to try other models at various stages of the process and see if I could figure out where the broken link was.

I tried several other binaryprotos, frozen graphs, and checkpoints:

The binaryproto from the pikachu detector https://github.com/giacomobartoli/vision-kit/tree/master/pikachu-detector: loaded directly to AIY and no problem with the object detection demo
The frozen graph from the custom cat detector https://github.com/chadwallacehart/aiy_custom_cat_detector/issues/2: compiled, loaded to AIY, and no problem with the object detection demo
The checkpoint from Zhoujustin https://drive.google.com/file/d/1_MeZ8kvmpNibPZvSJGnwKNRATeuyxNtu/view: exported, compiled, loaded to AIY, and had the AssertionError
A new checkpoint trained upon the Zhoujustin checkpoint: exported, compiled, loaded to AIY, and had the AssertionError
My own trained checkpoint: exported, compiled, loaded to AIY, and had the AssertionError

From my experiments, all the models that exported by myself had the problem, while all the models that NOT exported by myself had no problem. So seems like the exporting is the culprit. But why and how? I couldn't figure out. I was the tutorial here to export https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md

Solution I couldn't figure out what went wrong with the exporting. I suspect that it is related to the operating system. But this is so beyond my capability.

Before I gave up, I wanted to give it a try. So I reversed the order of logit_scores and box_encodings at line 266 of aiyprojects/src/aiy/vision/models/object_detection.py: #objs = _decode_detection_result(logit_scores, box_encodings, threshold, size, offset) objs = _decode_detection_result( box_encodings, logit_scores, threshold, size, offset)

Then it worked. The demo returned "Object #0: kind=TARGET(1), score=0.926968, bbox=(223, 186, 113, 115)" and this result test

And i tested a few more and all worked.

Thanks As I said, figuring out an explanation for what happened is beyond me at this point. So I just wanted to share this experience with you all. Hopefully it can help someone a bit and raise the attention for this issue.

Thank you for reading this far. Let me know if you have any questions.

weiran-work commented 5 years ago

@di-jabil

Thanks for the summary! Great effort on root-causing.

There are two possible reasons I can think of 1) platform dependent behavior, you are exporting on MacOs and compile on Ubuntu? Could you try exporting and compile both on Ubuntu? 2) the graph definition is wrong, could be a recent change on TensorFlow Object Det API. Output names are called concat, and concat1. Do you still have your tensorboard?

di-jabil commented 5 years ago

@weiranzhao

Thanks for the reply. Please see my answers below

1, I was going to try exporting on Ubuntu and it is still on my TODO list. May give it a try after the immediate project is done. Will keep you updated

2, My graph from the tensorboard is attached here. Do you think it helps? graph

Much appreciate your involvement. Looking forward to your future instructions.

BillMerryman commented 4 years ago

I just ran into this same issue, and I am using solely Ubuntu 18.04. I thought something was strange because everyone said that line 84:

assert len(logit_scores) == 4 * _NUM_ANCHORS

had to have the constant updated to the number of classes to be trained (5 in my case), but the script was asserting for box_encodings in the line that followed, so i updated that instead. It worked then but my detections were then all messed up (100s of bogus ones). Reverted the box encodings line, and set logit_scores back to 5 * _NUM_ANCHORS, but swapped them in the call to _decode_detection_result like you did, and it started working perfectly! Can't thank you enough for this catch!

google / aiyprojects-raspbian

Exported graph from MacOS causes reversed "logit_scores" and "box_encodings" #564