hhk7734 commented 4 years ago

@hhk7734 Very interesting to see v4 tiny on Edge TPU. I have two questions

What ops where not mapped to the TPU?
Did you quantize (post-training/training-aware) to INT8?

Thanks

Originally posted by @ankandrew in https://github.com/hhk7734/tensorflow-yolov4/issues/4#issuecomment-670947207

hhk7734 commented 4 years ago

Several tests are in progress on the TPU. I will comment 1~2 days later.

ankandrew commented 4 years ago

I should have opened a new issue =) Perfect, I am interested about the results)

hhk7734 commented 4 years ago

The yolov4-tiny full int8 model is shown below. The red area is the part that cannot be converted.

yolov4-tiny-relu

hhk7734 commented 4 years ago

There was a test script, but I forgot to tell it. https://github.com/hhk7734/tensorflow-yolov4/blob/master/test/make_edgetpu_tflite.ipynb

hhk7734 commented 4 years ago

❯ edgetpu_compiler yolov4-tiny-relu.tflite
Edge TPU Compiler version 14.1.317412892

Model compiled successfully in 764 ms.

Input model: yolov4-tiny-relu.tflite
Input size: 5.96MiB
Output model: yolov4-tiny-relu_edgetpu.tflite
Output size: 6.19MiB
On-chip memory used for caching model parameters: 5.92MiB
On-chip memory remaining for caching model parameters: 208.25KiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 150
Operation log: yolov4-tiny-relu_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 51
Number of operations that will run on CPU: 99
See the operation log file for individual operation details.

yolov4-tiny-relu_edgetpu

hhk7734 commented 4 years ago

TF 2 is not yet stable. Depending on the version, it may or may not be converted. A high version doesn't mean it works.

ankandrew commented 4 years ago

Thanks for the detailed response =) Any possible reasons why the yolo heads don't get mapped to the TPU? I don't see any opt not supported here.

Also, a good thing to keep in mind is the TPU slow-down when there is Off-chip memory used (some experiments). Do you think this can happen once the heads (everything in red circle) get mapped to tpu?

hhk7734 commented 4 years ago

In my test, Add, Sub, and Mul are each supported, but if there are more than three(? or four) consecutive operations, the parts are not converted.

Model: "YOLOv4Tiny"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
CSPDarknet53Tiny (CSPDarknet ((None, 38, 38, 256), (No 3633632   
_________________________________________________________________
PANetTiny (PANetTiny)        ((None, 38, 38, 255), (No 2429182   
_________________________________________________________________
YOLOv3HeadTiny (YOLOv3HeadTi ((None, 38, 38, 255), (No 0         
=================================================================
Total params: 6,062,814
Trainable params: 6,056,606
Non-trainable params: 6,208
_________________________________________________________________

yolov4-tiny has 6M params. The header contains constant matrixes and (I think) is therefore not included in the summary. I'll look at it :). This is really useful information. Thanks.

agjunyent commented 4 years ago

I've been doing lots of tests also with yolov4-tiny on Google Coral and the best solution is to split the model in two:

Darknet and Features
Features to Boxes

The first part is easily convertible to tflite->edgetpu, but the second part has problems ALWAYS (unmapped to TPU, unsuported ops, etc).

In the end I convert the last part (Features to Boxes) to pure vectorized numpy instead of Tensorflow or Tflite, and run that on CPU. I get an inference time of ~25ms, getting 40fps with this method.

I can share code if you want to take a look

hhk7734 commented 4 years ago

I've been doing lots of tests also with yolov4-tiny on Google Coral and the best solution is to split the model in two:

Darknet and Features

Features to Boxes

The first part is easily convertible to tflite->edgetpu, but the second part has problems ALWAYS (unmapped to TPU, unsuported ops, etc).

In the end I convert the last part (Features to Boxes) to pure vectorized numpy instead of Tensorflow or Tflite, and run that on CPU. I get an inference time of ~25ms, getting 40fps with this method.

I can share code if you want to take a look

WOW :astonished: I tried that too, but the Darknet and Features part was 50ms~. What's your TF and edgetpu_compiler version????

Your results are really impressive. I want it!!!

tgx-lim commented 4 years ago

Hi, First of all, thank you for your great efforts. Can you share 'yolov4-tiny-relu.weights'? And, when creating this weight, what is the difference from the 'yolov4-tiny.weights'? How can I make a yolov4-tiny-relu.weights from yolov4-tiny.weights?

hhk7734 commented 4 years ago

@tgx-lim

yolov4-tiny-relu.weights

Since model is being trained, the mAP score is still below expectations.

from yolov4.tf import YOLOv4

yolo = YOLOv4(tiny=True)

yolo.classes = "dataset/coco.names"

yolo.make_model(activation1="relu")
yolo.load_weights("yolov4-tiny-relu.weights", weights_type="yolo")

yolo.inference("image.png")

yolov4-tiny-relu use relu instead of ~mish~ leaky-relu. If you want to get a weights file, you should train the model.

ankandrew commented 4 years ago

@hhk7734

Did you also think about using relu6 instead of relu?

Although I'm not sure of the benefits of using one over the other. Because it looks like it is better for later quantization but on the other hand the EfficientNets EdgeTPU uses normal ReLU.

albertfaromatics commented 4 years ago

I've been doing lots of tests also with yolov4-tiny on Google Coral and the best solution is to split the model in two:

Darknet and Features

Features to Boxes

The first part is easily convertible to tflite->edgetpu, but the second part has problems ALWAYS (unmapped to TPU, unsuported ops, etc). In the end I convert the last part (Features to Boxes) to pure vectorized numpy instead of Tensorflow or Tflite, and run that on CPU. I get an inference time of ~25ms, getting 40fps with this method. I can share code if you want to take a look

WOW 😲 I tried that too, but the Darknet and Features part was 50ms~. What's your TF and edgetpu_compiler version????

Your results are really impressive. I want it!!!

I will share my code during the weekend!

TF is tf_nightly-2.2.0.dev20200422-cp36-cp36m-manylinux2010_x86_64 Edgetpu_compiler is latest version

I use google colab to convert the model

hhk7734 commented 4 years ago

@ankandrew I'm not sure what's better. I'll try relu6 after finishing relu test. :)

IlkayW commented 4 years ago

I have experienced significant drops in mAP if relu-6 is used. However, this result was from TinyYOLOv3.

raz-SX commented 4 years ago

I've been doing lots of tests also with yolov4-tiny on Google Coral and the best solution is to split the model in two:

Darknet and Features

Features to Boxes

The first part is easily convertible to tflite->edgetpu, but the second part has problems ALWAYS (unmapped to TPU, unsuported ops, etc).

In the end I convert the last part (Features to Boxes) to pure vectorized numpy instead of Tensorflow or Tflite, and run that on CPU. I get an inference time of ~25ms, getting 40fps with this method.

I can share code if you want to take a look

Hi @agjunyent, amazing work! How many supported operations did you manage to get? can I have a look at your code?

simondenhaene commented 4 years ago

Hi @agjunyent, I'm really impressed by the performances that you get. I would be very interested to have a look at your code.

JimBratsos commented 4 years ago

@agjunyent as everyone said here, likewise, I'm waiting for your code eagerly.

agjunyent commented 4 years ago

I'll organize and share the code during this week!!!

JimBratsos commented 4 years ago

Hey, I know no one likes bumps that much but is there any progress? I'm really looking forward to your code ! @agjunyent

agjunyent commented 4 years ago

Hey! Sorry for the delay. Been a busy week...

So I'll try to explain how I do it to get around 25ms inference time.

First of all, I train the model in google colab (just because my PC cannot do it) using the code from the .zip file I attach here train_inference_yolo.zip

Inside here you will see some files, but the most important are:

train.py -> training. There are some parameters that you can play with. The most important and needed ones are:
CLASSES_NAMES: a ".names" file with the names of your classes, one per line, WITHOUT new line at the end
TRAIN/VAL_DATASET: .tfrecord file of your dataset, created using this repo: https://github.com/mwindowshz/YoloToTfRecords
LOAD_WEIGHTS: wether to resume training or start from scratch, hdf5 format. All the other parameters are self explanatory
convert.py -> convert TF to TFlite for coral compiler. Just change the parameters to match your model. Then use the compiler to get the _coral.tflite version of the weights
inference.py -> inference. Only a demo code. You can pick that code for your purpouse. Just change:
WEIGHTS (l174): _coral.tflite format of the weights of the model
INPUT_IMAGE: path to the input image for inference
NUM_CLASSES: how many classes the model has been trained with

Try to download the files and play with them. I've tested them a bit, but not thoroughly, so expect some bugs. If any question just ask!

ownbee commented 4 years ago

@agjunyent Thank you for sharing! I'm trying to use train.py with pre-trained relu weights linked in this repo without success. Do you know the steps to do this?

agjunyent commented 4 years ago

Let me try today to use the weights linked on this repo. I modified the code quite a bit, both training and inference, to have 100% of operations mapped to CPU, and use numpy vectorization on the ones that could not be mapped, so plain Tensorflow is not used at all.

I'll get back when I have something

JimBratsos commented 4 years ago

Thanks @agjunyent for the files! I have a question. Is there a way to use only the convert.py, with yolov4-tiny weights so that I convert them to .tflite, or I need to do the entire process in order to retrain it? I tried using the convert.py but I could not convert the weights, so I suppose I need to run the training first right? I suppose you're using coco to retrain it?

Thanks for your help once again

ichakroun commented 4 years ago

TF 2 is not yet stable. Depending on the version, it may or may not be converted. A high version doesn't mean it works.

which version are you using because it doesn't convert with tf-gpu 2.2

"RuntimeError: Unsupported output type INT8 for output tensor 'Identity' of type FLOAT32."

itsmasabdi commented 4 years ago

@agjunyent

Thanks for sharing your code here.

I tried compiling the model for edgetpu and ended up with the following log file which takes a massive 1500ms on edgetpu. Was wondering if you have any pointer or if you can share some more info about the environment you used to obtain you 25ms performance. I tried multiple tf version and was only able to successfully compile the model using tf-nightly-2.5.0

Would appreciate any help here.

Edge TPU Compiler version 14.1.317412892
Input: converted_model.tflite
Output: converted_model_edgetpu.tflite
Operator                       Count      Status
CONV_2D                        19         Mapped to Edge TPU
CONV_2D                        2          More than one subgraph is not supported
QUANTIZE                       9          Mapped to Edge TPU
QUANTIZE                       1          Operation is otherwise supported, but not mapped due to some unspecified limitation
RESIZE_NEAREST_NEIGHBOR        1          Operation version not supported
CONCATENATION                  6          Mapped to Edge TPU
CONCATENATION                  1          More than one subgraph is not supported
PAD                            2          Mapped to Edge TPU
MAX_POOL_2D                    3          Mapped to Edge TPU
SPLIT                          3          Mapped to Edge TPU

hhk7734 commented 4 years ago

On my test yolov4-tiny-relu with head(224x224) 21ms ~ yolov4-tiny-relu with head(608x608) 132ms ~

I think this model can speed up by finding a few optimization methods, but is there any way to dramatically speed it up as long as we don't change the model itself?

edgeTPU benchmarks: https://coral.ai/docs/edgetpu/benchmarks/

paradigmn commented 3 years ago

Hi, please correct me if I am wrong, but I consider the current implementation of the edge tpu export to be broken. When using the most recent tensorflow release (2.3.1), a tflite model is exported, however the edgetpu-compiler wont work with it. I noticed this problem is caused by tf switching from their toco tflite converter to a new one. The new converter can't handle the tf.exp() op in the graph.

I was able to get it working by disabling the new converter. This is certainly not a permanent solution, however it works with all tf versions including 2.5-nightly. When using the most recent compiler with the new -a flag, 97 of 128 operations run on the edge tpu.

If desired, I can make a pull request, with the necessary changes. I can add a script for exporting the edge tpu and onnx model too.

hhk7734 commented 3 years ago

I compiled the tflite with -a.

$ edgetpu_compiler -a yolov4-tiny-relu-int8.tflite
Edge TPU Compiler version 15.0.340273435

Model compiled successfully in 1105 ms.

Input model: yolov4-tiny-relu-int8.tflite
Input size: 5.96MiB
Output model: yolov4-tiny-relu-int8_edgetpu.tflite
Output size: 6.28MiB
On-chip memory used for caching model parameters: 6.06MiB
On-chip memory remaining for caching model parameters: 716.25KiB
Off-chip memory used for streaming uncached model parameters: 3.38KiB
Number of Edge TPU subgraphs: 2
Total number of operations: 149
Operation log: yolov4-tiny-relu-int8_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 98
Number of operations that will run on CPU: 51
See the operation log file for individual operation details.

Compared to without -a, it runs 48 more on the Edge TPU. But slower than before. input_size = (512, 384) without -a, FPS: 11 ~ 12 with -a, FPS: 9~10

paradigmn commented 3 years ago

My results are quite the opposite. I used the tiny_yolov4 with relu activation and weights provided by your repo. The input tensor has a size of (608, 608, 3). With the -a flag i get three subgraphs with 97 of 128 operations running on the tpu. Without the flag I have one subgraph with 42/128 operations mapped. This gives me the following inference times for 5 runs:

	Run 1	Run 2	Run 3	Run 4	Run 5
with -a flag	0.0722s	0.0650s	0.0642s	0.0629s	0.0687
without -a flag	0.2012s	0.1675s	0.1743s	0.1665s	0.1702s

hhk7734 commented 3 years ago

Please make PR. I'm curious about your code. :open_mouth:

paradigmn commented 3 years ago

Alright, I'm on it. It will take a bit though, because I want to write a sanity test beforehand. At this point I only testet the inference time with random input data. I want to see, if the model with quantized inputs and outputs is able to make useful predictions on real data (kite.jpg).

farhantandia commented 3 years ago

The yolov4-tiny full int8 model is shown below. The red area is the part that cannot be converted.

How to convert from tiny tensorflow to tflite int8? I have follow the tutorial from your repo's issue to convert to int8 but it failed while compiling for edge tpu "Model not quantized". Thanks.

hhk7734 commented 3 years ago

@farhantandia Did you follow this? https://wiki.loliot.net/docs/lang/python/libraries/yolov4/python-yolov4-edge-tpu

farhantandia commented 3 years ago

@hhk7734 oh isee, so it requires to download val2017 images first right?

hhk7734 commented 3 years ago

yep, for post training. What is your target? mobile?

farhantandia commented 3 years ago

I try to implement it on raspi 4 with coral when i try to convert error occurs

dataset = YOLODataset(
  File "/home/farhan/.local/lib/python3.8/site-packages/yolov4/tf/dataset/keras_sequence.py", line 52, in __init__
    self.dataset = parse_dataset(
  File "/home/farhan/.local/lib/python3.8/site-packages/yolov4/common/parser.py", line 219, in parse_dataset
    raise RuntimeError(
RuntimeError: parse_dataset: 'center_x', 'center_y', 'width', and 'height' are between 0.0 and 1.0.

what is the problem?

I download the dataset val2017 from coco website and val2017.txt from repository

hhk7734 commented 3 years ago

'center_x', 'center_y', 'width', and 'height' should be between 0.0 and 1.0. I just tested it and it doesn't seem to be any problem.

farhantandia commented 3 years ago

Hei, actually just directory issue, it works, thank you :D

farhantandia commented 3 years ago

How you get the tiny-relu version? did you just change the activation "leaky" to "relu"? Ive some issue to run the video detection using this code

import cv2

from yolov4.tflite import YOLOv4

yolo = YOLOv4()

yolo.config.parse_names("dataset/coco.names")
yolo.config.parse_cfg("config/yolov4-tiny-relu-tpu.cfg")

yolo.summary()

yolo.load_tflite("yolov4-tiny-relu-int8_edgetpu.tflite")

yolo.inference(
    "road.mp4",
    is_image=False,
    cv_apiPreference=cv2.CAP_V4L2,
    cv_frame_size=(640, 480),
    cv_fourcc="YUYV",
)

it just pop a blank cv2 window, but for image is working fine.

hhk7734 commented 3 years ago

yolo.inference("road.mp4", is_image=False)

yes, I just change leaky to relu. But, not well trained. I plan to do transfer learning after backbone training.

hhk7734 commented 3 years ago

Model

EdgeTPU Ops: https://coral.ai/docs/edgetpu/models-intro/#supported-operations

yolov4-tiny

image -> conv2d -> ... -> conv2d -> yolo_0
                   ... -> conv2d -> yolo_1

yolo layer

input
x, y, w, h, o, c0, c1, ...

output
(scale * logistc(x) - 0.5 * (scale - 1) + cx) / grid_width,
(scale * logistc(y) - 0.5 * (scale - 1) + cy) / grid_height,
prior * exp(w) / net_width
prior * exp(h) / net_height
logistic(o)
logistic(c0)
logistic(c1)
...

prior == anchor == biases

EdgeTPU

In the current situation, not all layers are mapped to the TPU, because of SPLIT_V, EXP, ... Even if you can map all of them, Too many layers have too much information loss at 8-bit precision.

We have to choose whether to change the model so that it can use TPU more or to give up some and run it on the CPU. This can be a question of whether you choose speed or precision.

When using TPU, I removed all operations from yolo except logistic.

Converted model

model

Identity - x0, Identity_1 - logistic(x0) Identity_2 - x1, Identity_3 - logistic(x1)

FPS test

Model only

input shape (1, 416, 416, 3)

In [9]: def model(x):
   ...:     yolo._interpreter.set_tensor(yolo._input_details["index"], x)
   ...:     yolo._interpreter.invoke()
   ...:         # [yolo0, yolo1, ...]
   ...:         # yolo == Dim(1, height, width, channels)
   ...:         # yolo_tpu == x, logistic(x)
   ...:
   ...:     return [
   ...:         yolo._interpreter.get_tensor(output_detail["index"])
   ...:         for output_detail in yolo._output_details
   ...:     ]
   ...:

In [10]: 100/timeit.timeit(lambda: model(x), number=100)
Out[10]: 31.288735650288498

model + scale_x_y + copy x[..., wh] to logistic(x)[..., wh]

https://github.com/hhk7734/tensorflow-yolov4/blob/b67ca4557d3073d59bd8bb8c5876243ea29308f9/py_src/yolov4/tflite/__init__.py#L97-L129

input shape (1, 416, 416, 3)

In [14]: 100/timeit.timeit(lambda: yolo._predict(x), number=100)                                                         
Out[14]: 30.969583151128262

resize -> ... -> diounms

https://github.com/hhk7734/tensorflow-yolov4/blob/b67ca4557d3073d59bd8bb8c5876243ea29308f9/py_src/yolov4/common/base_class.py#L189-L191

yolo.predict(x, prob_thresh) do resize image -> _predict -> diounms -> fit pred bbox to original image shape.

probability thresh 25%
image shape (640, 480, 3)
input shape (1, 416, 416, 3)

24 ~ 29 FPS depending on the number of objects found.

Plan

Train yolov4-tiny-relu and yolov4-tiny-relu-new_coords on darknet to get AP50 35%~ (coco val2017)

farhantandia commented 3 years ago

@hhk7734 what tpu you are using?

hhk7734 commented 3 years ago

@farhantandia Coral dev board

hhk7734 commented 3 years ago

hhk7734 / tensorflow-yolov4

Questions about v4 tiny on Edge TPU #20

Model

EdgeTPU

Converted model

FPS test

Model only

model + scale_x_y + copy x[..., wh] to logistic(x)[..., wh]

resize -> ... -> diounms

Plan

49 #86