google-coral / edgetpu

Coral issue tracker (and legacy Edge TPU API source)
https://coral.ai
Apache License 2.0
417 stars 125 forks source link

Internal compiler error. Aborting! #100

Closed Eashwar93 closed 4 years ago

Eashwar93 commented 4 years ago

System overview:

Ubuntu 18.04 TF-GPU 1.15 installed from binary

Problem:

I am trying to compile a quantized TFlite model which was converted from a frozen graph enabling pose-estimation(openpose). I was able to generate the fully quantized tflite model, but unable to compile it using the edgetpucompiler. As the TF model was generated using TF1.x I made use of the tfliteconverter from TF1.15. Below is the code that I used for the conversion :

`import tensorflow as tf import numpy as np

def representative_datasetgen(): for in range(100): fake_image = np.random.random((1,432,368,3)).astype(np.float32) yield [fake_image]

graph_pb = 'graph_freeze.pb' inp = ['image'] out = ['Openpose/concat_stage7'] converter=tf.lite.TFLiteConverter.from_frozen_graph( graph_pb, inp, out,input_shapes={"image":[1,432,368,3]}) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.representative_dataset = representative_dataset_gen converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.uint8 converter.inference_output_type = tf.uint8 tflite_model = converter.convert()

f = open("tflite_model/mobilenet_thin_openpose_opt_fullint_tf1.tflite", "wb") f.write(tflite_model)

f.close() print("conversion complete")`

I then tried to compile this and I'm just getting the following error :

edgetpu_compiler mobilenet_thin_openpose_opt_fullint_tf1.tflite Edge TPU Compiler version 2.1.302470888 Internal compiler error. Aborting!

I tried going through similar issues but could not find a proper solution. It would be very nice if I could get through with the issue.

Namburger commented 4 years ago

@Eashwar93 Apologies, most compiler issues are reported to our internal teams for evaluation and then fixes. Since the compilers are not open source at this time, the fixes are usually delayed until each release, this is why you may have found some issues that are not fix. Do you mind trying this solution at the moment?

Eashwar93 commented 4 years ago

@Namburger Thanks. I actually feel that it is a long way around , though I use the same tf-lite converter except for the fact that he converts it from a keras model , but I convert it directly from a frozen graph. Theoretically there should not be any problem as at the end I have a fully quantized tflite model to be passed on to the edgeTPU compiler.

Namburger commented 4 years ago

@Eashwar93 could you share the cpu model, I can take a look at it. In general, I don't think there should be difference between graph vs .h5 but we do have limited compatibilities at the moment, something we are working to expand

Eashwar93 commented 4 years ago

@Namburger Sure Screenshot from 2020-04-24 15-23-44

Namburger commented 4 years ago

@Eashwar93 sorry, I mean the cpu's tflite model, not the cpu info

Eashwar93 commented 4 years ago

@Namburger Oops sorry my bad.

https://drive.google.com/file/d/1BZ0gVX00Vnqhthx_wxmMXQAbkisSAftN/view?usp=sharing

I hope you are able to download the model

Namburger commented 4 years ago

@Eashwar93 taking a look at this now, sorry that our compiler currently doesn't give much error message, I know it's frustrating. From my knowledge, you are using tf1.15 for ptq, correct?

Eashwar93 commented 4 years ago

@Namburger it is completely fine :) Yes I use tf-1.15 for ptq

Namburger commented 4 years ago

@Eashwar93 Are you familiar with the visualized tools or netron to visualize your model? There is a layer name Openpose/MConv_Stage3_concat with 2 preceeded layers that has mismatching quantization parameters(scale and zero point), this causes the compiler to rejects the entire model. I wonder if the CPU model give you the expected results? This is most likely a bug with the tflite converter tool, could you open a bug here also?

On another note, maybe you could try using tf2. for ptq (maybe the above bug is fixed). PTQ with tf2. will cause your models i/o tensors to be of type float. We are still in experimental stage of supporting this but the compiler should allows float i/o now, here is an example usage in the form of test! If it still fails, please reattach the model so I can check again. Apologies for this issue and please also checkout project-posenet, there are too many layers of dependencies to give a straight forward fix :/

Eashwar93 commented 4 years ago

@Namburger Ah ok thanks a ton. I will look into it. I will open a bug in the tensorflow git as well. I tried using netron now and could not find any quantization parameters there. I am trying to use visualizer.py now. Once I can pin point the issue in TFLite converter I will start a Git thread at Tensorflow. Also I tried to convert with Tf2.1, there was an issue which I possibly think is due to the fact that this Network which I'm converting uses tf.contrib which is no longer supported in tf2.x I guess. I'm not sure if that's the issue but it's the most likely I can think off. I also checked out your project posenet. It's great, I wanted to compare it with Openpose which is why I'm embarked on this task. Thanks a lot for your help. I will keep you posted. :)

Eashwar93 commented 4 years ago

@Namburger I went through the model and I'm unable to spot what is wrong actually. Apologies that if my question is too trivial but I'm new to this entire framework. So you were referring to the tensor Openpose/MConv_Stage3_concat. I took a look at it. Screenshot from 2020-04-27 11-26-16 So then I went to the ops section to see the inputs for the output tensor [217] and found them to be

387 Openpose/MConv_Stage2_L1_5_pointwise/BatchNorm/FusedBatchNorm_requantized INT8 [1, 54, 46, 38] 388 Openpose/MConv_Stage2_L2_5_pointwise/BatchNorm/FusedBatchNorm_requantized INT8 [1, 54, 46, 19] 389 feat_concat_requantized INT8 [1, 54, 46, 864]: Screenshot from 2020-04-27 11-31-28 When I checked the quantization params of these input tensors, there were no mismatch as shown below. Screenshot from 2020-04-27 11-36-14

Apologies again, but I really want to solve this issue.

Thanks

Namburger commented 4 years ago

@Eashwar93 no worries, I found the visualize output is not super intuitive to read :/ These are the offenders:

387 | Openpose/MConv_Stage2_L1_5_pointwise/BatchNorm/FusedBatchNorm_requantized | INT8 | [1, 54, 46, 38] | None | 0 | {'quantized_dimension': 0, 'scale': [0.12664233], 'min': None, 'max': None, 'details_type': 0, 'zero_point': [-128], 'details': None}
392 | feat_concat_requantized | INT8 | [1, 54, 46, 864] | None | 0 | {'quantized_dimension': 0, 'scale': [0.12680078], 'min': None, 'max': None, 'details_type': 0, 'zero_point': [-127], 'details': None}

The problem is that AFAIK there isn't a way to fix this on the user side, did you check the model's accuracy after the tflite conversion?

Eashwar93 commented 4 years ago

@Namburger Thank you. I guess with this info I can create a bug with the TF team. No I haven't tested the model after conversion. Will do that as well probably in python just to test. As of now I'm trying to explore posenet in C++

OmriTreidel commented 4 years ago

@Namburger I am running into similar issues with my model (trained and quantized with TF-2.4) and I am trying to understand how to locate such mismatches. I've inspected the model @Eashwar93 attached to this ticket using Netron but was not able to find the mismatch you found.

Inspection of the tensor Openpose/MConv_Stage3_concat inputs: name: Openpose/MConv_Stage2_L1_5_pointwise/BatchNorm/FusedBatchNorm_requantized type: int8[1,54,46,38] quantization: 0.12664233148097992 (q - -128) location: 387 name: Openpose/MConv_Stage2_L2_5_pointwise/BatchNorm/FusedBatchNorm_requantized type: int8[1,54,46,19] quantization: 0.12664233148097992 (q - -128) location: 388 name: feat_concat_requantized type: int8[1,54,46,864] quantization: 0.12664233148097992 * (q - -128) location: 389

shows that the quantization is consistent and the locations are 387, 388 and 389.

The tensor in location 392 (that you've mention is no compatible with 387) is used in Openpose/MConv_Stage2_concat together with 390 and 391.

name: Openpose/MConv_Stage1_L1_5_pointwise/BatchNorm/FusedBatchNorm_requantized type: int8[1,54,46,38] quantization: 0.1268007755279541 (q - -127) location: 390 name: Openpose/MConv_Stage1_L2_5_pointwise/BatchNorm/FusedBatchNorm_requantized type: int8[1,54,46,19] quantization: 0.1268007755279541 (q - -127) location: 391 name: feat_concat_requantized type: int8[1,54,46,864] quantization: 0.1268007755279541 * (q - -127) location: 392

I was not able to find where the two tensors you've mentioned are being concatenated.

Can you please elaborate how you found the issue?

Just for completeness here is a link to the model I'm trying to compile and getting the same error

https://drive.google.com/file/d/1yrTSm3A8u4_ORyqcs_yGD-jxQOXewTGR/view?usp=sharing