Open hhk7734 opened 4 years ago
Several tests are in progress on the TPU. I will comment 1~2 days later.
I should have opened a new issue =) Perfect, I am interested about the results)
The yolov4-tiny full int8 model is shown below. The red area is the part that cannot be converted.
There was a test script, but I forgot to tell it. https://github.com/hhk7734/tensorflow-yolov4/blob/master/test/make_edgetpu_tflite.ipynb
❯ edgetpu_compiler yolov4-tiny-relu.tflite
Edge TPU Compiler version 14.1.317412892
Model compiled successfully in 764 ms.
Input model: yolov4-tiny-relu.tflite
Input size: 5.96MiB
Output model: yolov4-tiny-relu_edgetpu.tflite
Output size: 6.19MiB
On-chip memory used for caching model parameters: 5.92MiB
On-chip memory remaining for caching model parameters: 208.25KiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 150
Operation log: yolov4-tiny-relu_edgetpu.log
Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 51
Number of operations that will run on CPU: 99
See the operation log file for individual operation details.
TF 2 is not yet stable. Depending on the version, it may or may not be converted. A high version doesn't mean it works.
Thanks for the detailed response =) Any possible reasons why the yolo heads don't get mapped to the TPU? I don't see any opt not supported here.
Also, a good thing to keep in mind is the TPU slow-down when there is Off-chip memory used
(some experiments). Do you think this can happen once the heads (everything in red circle) get mapped to tpu?
In my test, Add, Sub, and Mul are each supported, but if there are more than three(? or four) consecutive operations, the parts are not converted.
Model: "YOLOv4Tiny"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
CSPDarknet53Tiny (CSPDarknet ((None, 38, 38, 256), (No 3633632
_________________________________________________________________
PANetTiny (PANetTiny) ((None, 38, 38, 255), (No 2429182
_________________________________________________________________
YOLOv3HeadTiny (YOLOv3HeadTi ((None, 38, 38, 255), (No 0
=================================================================
Total params: 6,062,814
Trainable params: 6,056,606
Non-trainable params: 6,208
_________________________________________________________________
yolov4-tiny has 6M params. The header contains constant matrixes and (I think) is therefore not included in the summary. I'll look at it :). This is really useful information. Thanks.
I've been doing lots of tests also with yolov4-tiny on Google Coral and the best solution is to split the model in two:
The first part is easily convertible to tflite->edgetpu, but the second part has problems ALWAYS (unmapped to TPU, unsuported ops, etc).
In the end I convert the last part (Features to Boxes) to pure vectorized numpy instead of Tensorflow or Tflite, and run that on CPU. I get an inference time of ~25ms, getting 40fps with this method.
I can share code if you want to take a look
I've been doing lots of tests also with yolov4-tiny on Google Coral and the best solution is to split the model in two:
- Darknet and Features
- Features to Boxes
The first part is easily convertible to tflite->edgetpu, but the second part has problems ALWAYS (unmapped to TPU, unsuported ops, etc).
In the end I convert the last part (Features to Boxes) to pure vectorized numpy instead of Tensorflow or Tflite, and run that on CPU. I get an inference time of ~25ms, getting 40fps with this method.
I can share code if you want to take a look
WOW :astonished:
I tried that too, but the Darknet and Features
part was 50ms~.
What's your TF and edgetpu_compiler version????
Your results are really impressive. I want it!!!
Hi, First of all, thank you for your great efforts. Can you share 'yolov4-tiny-relu.weights'? And, when creating this weight, what is the difference from the 'yolov4-tiny.weights'? How can I make a yolov4-tiny-relu.weights from yolov4-tiny.weights?
@tgx-lim
Since model is being trained, the mAP score is still below expectations.
from yolov4.tf import YOLOv4
yolo = YOLOv4(tiny=True)
yolo.classes = "dataset/coco.names"
yolo.make_model(activation1="relu")
yolo.load_weights("yolov4-tiny-relu.weights", weights_type="yolo")
yolo.inference("image.png")
yolov4-tiny-relu use relu instead of ~mish~ leaky-relu. If you want to get a weights file, you should train the model.
@hhk7734
Did you also think about using relu6 instead of relu?
Although I'm not sure of the benefits of using one over the other. Because it looks like it is better for later quantization but on the other hand the EfficientNets EdgeTPU uses normal ReLU.
I've been doing lots of tests also with yolov4-tiny on Google Coral and the best solution is to split the model in two:
- Darknet and Features
- Features to Boxes
The first part is easily convertible to tflite->edgetpu, but the second part has problems ALWAYS (unmapped to TPU, unsuported ops, etc). In the end I convert the last part (Features to Boxes) to pure vectorized numpy instead of Tensorflow or Tflite, and run that on CPU. I get an inference time of ~25ms, getting 40fps with this method. I can share code if you want to take a look
WOW 😲 I tried that too, but the
Darknet and Features
part was 50ms~. What's your TF and edgetpu_compiler version????Your results are really impressive. I want it!!!
I will share my code during the weekend!
TF is tf_nightly-2.2.0.dev20200422-cp36-cp36m-manylinux2010_x86_64 Edgetpu_compiler is latest version
I use google colab to convert the model
@ankandrew I'm not sure what's better. I'll try relu6 after finishing relu test. :)
I have experienced significant drops in mAP if relu-6 is used. However, this result was from TinyYOLOv3.
I've been doing lots of tests also with yolov4-tiny on Google Coral and the best solution is to split the model in two:
- Darknet and Features
- Features to Boxes
The first part is easily convertible to tflite->edgetpu, but the second part has problems ALWAYS (unmapped to TPU, unsuported ops, etc).
In the end I convert the last part (Features to Boxes) to pure vectorized numpy instead of Tensorflow or Tflite, and run that on CPU. I get an inference time of ~25ms, getting 40fps with this method.
I can share code if you want to take a look
Hi @agjunyent, amazing work! How many supported operations did you manage to get? can I have a look at your code?
Hi @agjunyent, I'm really impressed by the performances that you get. I would be very interested to have a look at your code.
@agjunyent as everyone said here, likewise, I'm waiting for your code eagerly.
I'll organize and share the code during this week!!!
Hey, I know no one likes bumps that much but is there any progress? I'm really looking forward to your code ! @agjunyent
Hey! Sorry for the delay. Been a busy week...
So I'll try to explain how I do it to get around 25ms inference time.
First of all, I train the model in google colab (just because my PC cannot do it) using the code from the .zip file I attach here train_inference_yolo.zip
Inside here you will see some files, but the most important are:
Try to download the files and play with them. I've tested them a bit, but not thoroughly, so expect some bugs. If any question just ask!
@agjunyent Thank you for sharing! I'm trying to use train.py with pre-trained relu weights linked in this repo without success. Do you know the steps to do this?
Let me try today to use the weights linked on this repo. I modified the code quite a bit, both training and inference, to have 100% of operations mapped to CPU, and use numpy vectorization on the ones that could not be mapped, so plain Tensorflow is not used at all.
I'll get back when I have something
Thanks @agjunyent for the files! I have a question. Is there a way to use only the convert.py, with yolov4-tiny weights so that I convert them to .tflite, or I need to do the entire process in order to retrain it? I tried using the convert.py but I could not convert the weights, so I suppose I need to run the training first right? I suppose you're using coco to retrain it?
Thanks for your help once again
TF 2 is not yet stable. Depending on the version, it may or may not be converted. A high version doesn't mean it works.
which version are you using because it doesn't convert with tf-gpu 2.2
"RuntimeError: Unsupported output type INT8 for output tensor 'Identity' of type FLOAT32."
@agjunyent
Thanks for sharing your code here.
I tried compiling the model for edgetpu and ended up with the following log file which takes a massive 1500ms on edgetpu. Was wondering if you have any pointer or if you can share some more info about the environment you used to obtain you 25ms performance. I tried multiple tf version and was only able to successfully compile the model using tf-nightly-2.5.0
Would appreciate any help here.
Edge TPU Compiler version 14.1.317412892
Input: converted_model.tflite
Output: converted_model_edgetpu.tflite
Operator Count Status
CONV_2D 19 Mapped to Edge TPU
CONV_2D 2 More than one subgraph is not supported
QUANTIZE 9 Mapped to Edge TPU
QUANTIZE 1 Operation is otherwise supported, but not mapped due to some unspecified limitation
RESIZE_NEAREST_NEIGHBOR 1 Operation version not supported
CONCATENATION 6 Mapped to Edge TPU
CONCATENATION 1 More than one subgraph is not supported
PAD 2 Mapped to Edge TPU
MAX_POOL_2D 3 Mapped to Edge TPU
SPLIT 3 Mapped to Edge TPU
On my test yolov4-tiny-relu with head(224x224) 21ms ~ yolov4-tiny-relu with head(608x608) 132ms ~
I think this model can speed up by finding a few optimization methods, but is there any way to dramatically speed it up as long as we don't change the model itself?
edgeTPU benchmarks: https://coral.ai/docs/edgetpu/benchmarks/
Hi, please correct me if I am wrong, but I consider the current implementation of the edge tpu export to be broken. When using the most recent tensorflow release (2.3.1), a tflite model is exported, however the edgetpu-compiler wont work with it. I noticed this problem is caused by tf switching from their toco tflite converter to a new one. The new converter can't handle the tf.exp() op in the graph.
I was able to get it working by disabling the new converter. This is certainly not a permanent solution, however it works with all tf versions including 2.5-nightly. When using the most recent compiler with the new -a flag, 97 of 128 operations run on the edge tpu.
If desired, I can make a pull request, with the necessary changes. I can add a script for exporting the edge tpu and onnx model too.
I compiled the tflite with -a.
$ edgetpu_compiler -a yolov4-tiny-relu-int8.tflite
Edge TPU Compiler version 15.0.340273435
Model compiled successfully in 1105 ms.
Input model: yolov4-tiny-relu-int8.tflite
Input size: 5.96MiB
Output model: yolov4-tiny-relu-int8_edgetpu.tflite
Output size: 6.28MiB
On-chip memory used for caching model parameters: 6.06MiB
On-chip memory remaining for caching model parameters: 716.25KiB
Off-chip memory used for streaming uncached model parameters: 3.38KiB
Number of Edge TPU subgraphs: 2
Total number of operations: 149
Operation log: yolov4-tiny-relu-int8_edgetpu.log
Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 98
Number of operations that will run on CPU: 51
See the operation log file for individual operation details.
Compared to without -a, it runs 48 more on the Edge TPU. But slower than before. input_size = (512, 384) without -a, FPS: 11 ~ 12 with -a, FPS: 9~10
My results are quite the opposite. I used the tiny_yolov4 with relu activation and weights provided by your repo. The input tensor has a size of (608, 608, 3). With the -a flag i get three subgraphs with 97 of 128 operations running on the tpu. Without the flag I have one subgraph with 42/128 operations mapped. This gives me the following inference times for 5 runs:
Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | |
---|---|---|---|---|---|
with -a flag | 0.0722s | 0.0650s | 0.0642s | 0.0629s | 0.0687 |
without -a flag | 0.2012s | 0.1675s | 0.1743s | 0.1665s | 0.1702s |
Please make PR. I'm curious about your code. :open_mouth:
Alright, I'm on it. It will take a bit though, because I want to write a sanity test beforehand. At this point I only testet the inference time with random input data. I want to see, if the model with quantized inputs and outputs is able to make useful predictions on real data (kite.jpg).
The yolov4-tiny full int8 model is shown below. The red area is the part that cannot be converted.
How to convert from tiny tensorflow to tflite int8? I have follow the tutorial from your repo's issue to convert to int8 but it failed while compiling for edge tpu "Model not quantized". Thanks.
@farhantandia Did you follow this? https://wiki.loliot.net/docs/lang/python/libraries/yolov4/python-yolov4-edge-tpu
@hhk7734 oh isee, so it requires to download val2017 images first right?
yep, for post training. What is your target? mobile?
I try to implement it on raspi 4 with coral when i try to convert error occurs
dataset = YOLODataset(
File "/home/farhan/.local/lib/python3.8/site-packages/yolov4/tf/dataset/keras_sequence.py", line 52, in __init__
self.dataset = parse_dataset(
File "/home/farhan/.local/lib/python3.8/site-packages/yolov4/common/parser.py", line 219, in parse_dataset
raise RuntimeError(
RuntimeError: parse_dataset: 'center_x', 'center_y', 'width', and 'height' are between 0.0 and 1.0.
what is the problem?
I download the dataset val2017 from coco website and val2017.txt from repository
'center_x', 'center_y', 'width', and 'height' should be between 0.0 and 1.0. I just tested it and it doesn't seem to be any problem.
Hei, actually just directory issue, it works, thank you :D
How you get the tiny-relu version? did you just change the activation "leaky" to "relu"? Ive some issue to run the video detection using this code
import cv2
from yolov4.tflite import YOLOv4
yolo = YOLOv4()
yolo.config.parse_names("dataset/coco.names")
yolo.config.parse_cfg("config/yolov4-tiny-relu-tpu.cfg")
yolo.summary()
yolo.load_tflite("yolov4-tiny-relu-int8_edgetpu.tflite")
yolo.inference(
"road.mp4",
is_image=False,
cv_apiPreference=cv2.CAP_V4L2,
cv_frame_size=(640, 480),
cv_fourcc="YUYV",
)
it just pop a blank cv2 window, but for image is working fine.
yolo.inference("road.mp4", is_image=False)
yes, I just change leaky to relu. But, not well trained. I plan to do transfer learning after backbone training.
EdgeTPU Ops: https://coral.ai/docs/edgetpu/models-intro/#supported-operations
yolov4-tiny
image -> conv2d -> ... -> conv2d -> yolo_0
... -> conv2d -> yolo_1
yolo layer
input
x, y, w, h, o, c0, c1, ...
output
(scale * logistc(x) - 0.5 * (scale - 1) + cx) / grid_width,
(scale * logistc(y) - 0.5 * (scale - 1) + cy) / grid_height,
prior * exp(w) / net_width
prior * exp(h) / net_height
logistic(o)
logistic(c0)
logistic(c1)
...
prior == anchor == biases
In the current situation, not all layers are mapped to the TPU, because of SPLIT_V
, EXP
, ...
Even if you can map all of them, Too many layers have too much information loss at 8-bit precision.
We have to choose whether to change the model so that it can use TPU more or to give up some and run it on the CPU. This can be a question of whether you choose speed or precision.
When using TPU, I removed all operations from yolo except logistic.
Identity - x0, Identity_1 - logistic(x0) Identity_2 - x1, Identity_3 - logistic(x1)
In [9]: def model(x):
...: yolo._interpreter.set_tensor(yolo._input_details["index"], x)
...: yolo._interpreter.invoke()
...: # [yolo0, yolo1, ...]
...: # yolo == Dim(1, height, width, channels)
...: # yolo_tpu == x, logistic(x)
...:
...: return [
...: yolo._interpreter.get_tensor(output_detail["index"])
...: for output_detail in yolo._output_details
...: ]
...:
In [10]: 100/timeit.timeit(lambda: model(x), number=100)
Out[10]: 31.288735650288498
In [14]: 100/timeit.timeit(lambda: yolo._predict(x), number=100)
Out[14]: 30.969583151128262
yolo.predict(x, prob_thresh)
do resize image -> _predict -> diounms -> fit pred bbox to original image shape.
24 ~ 29 FPS depending on the number of objects found.
yolov4-tiny-relu
and yolov4-tiny-relu-new_coords
on darknet to get AP50 35%~ (coco val2017)@hhk7734 what tpu you are using?
@farhantandia Coral dev board
@hhk7734 Very interesting to see v4 tiny on Edge TPU. I have two questions
Thanks
Originally posted by @ankandrew in https://github.com/hhk7734/tensorflow-yolov4/issues/4#issuecomment-670947207