hhk7734 / tensorflow-yolov4

YOLOv4 Implemented in Tensorflow 2.
MIT License
136 stars 75 forks source link

EdgeTPU Compiler: Internal compiler error. Aborting. #49

Open keesschollaart81 opened 3 years ago

keesschollaart81 commented 3 years ago

First of all, thanks a lot for the great work/package/documentation!

I trained a YoloV4 Tiny model using this script:

from tensorflow.keras import callbacks, optimizers
from yolov4.tf import SaveWeightsCallback, YOLOv4
import time

yolo = YOLOv4(tiny=True)
yolo.classes = "my-project/my-classes.names"
yolo.input_size = 608
yolo.batch_size = 32

yolo.make_model(activation1="relu")
yolo.load_weights(
    "my-project/yolov4-tiny.conv.29",
    weights_type="yolo"
)

train_data_set = yolo.load_dataset(
    "my-project/train.txt",
    image_path_prefix="",
    label_smoothing=0.05
)
val_data_set = yolo.load_dataset(
    "my-project/val.txt",
    image_path_prefix="",
    training=False
)

epochs = 400
lr = 1e-4

optimizer = optimizers.Adam(learning_rate=lr)
yolo.compile(optimizer=optimizer, loss_iou_type="ciou")

def lr_scheduler(epoch):
    if epoch < int(epochs * 0.5):
        return lr
    if epoch < int(epochs * 0.8):
        return lr * 0.5
    if epoch < int(epochs * 0.9):
        return lr * 0.1
    return lr * 0.01

_callbacks = [
    callbacks.LearningRateScheduler(lr_scheduler),
    callbacks.TerminateOnNaN(),
    callbacks.TensorBoard(
        log_dir="my-project/logs",
    ),
    SaveWeightsCallback(
        yolo=yolo, dir_path="my-project/logs/trained",
        weights_type="yolo", epoch_per_save=10
    ),
]

yolo.fit(
    train_data_set,
    epochs=epochs,
    callbacks=_callbacks,
    validation_data=val_data_set,
    validation_steps=50,
    validation_freq=5,
    steps_per_epoch=100,
)

Using the produced weights file, I verified that this model works, looking very promising!

Then I converted to Quantized TFLite using this script:

from yolov4.tf import YOLOv4

yolo = YOLOv4(tiny=True, tpu=True)

yolo.classes = "my-project/coco.names"
yolo.input_size = (608, 608) # width, height
yolo.make_model(activation1="relu")

yolo.load_weights("my-project/yolov4-tiny-40.weights", weights_type="yolo")

dataset = yolo.load_dataset(
    "my-project/train.txt",
    training=False,
    image_path_prefix="JPEGImages/"
)

yolo.save_as_tflite(
    "my-project/yolov4-tiny-relu-int8.tflite",
    quantization="full_int8",
    data_set=dataset,
    num_calibration_steps=400
)

The produced TF Lite model seems fine: yolov4-tiny-relu-int8.tflite.zip

But the EdgeTPU Compiler failed with:

Edge TPU Compiler version 15.0.340273435

Internal compiler error. Aborting!

I tried lower input resolutions (TF Lite conversion time), nu success. I also tried the conversion with both with the 2.0.1 release on PyPi and with the current master branch. Any clue on what I seem to be missing?

hhk7734 commented 3 years ago

Try master branch

python3 -m pip install -U --no-cache-dir yolov4

Ref: #45

keesschollaart81 commented 3 years ago

Ok.. I was building a Colab with a repro, then... It suddenly worked. After looking surprised at my screen for a few minutes and some more experiments to verify my finding, it seems that I must have missed something. I'm pretty sure I already tested the latest bits from master in combination with TF2.4. It must have something to do with the training as only weights from my new environment can be converted. I'll experiment a bit more tomorrow/next week but for now we can consider this an issue between my chair and my screen ;-)

Again, great work, happy to be able to use this model on my edgetpu!

keesschollaart81 commented 3 years ago

I still run into this issue, now that I have a working setup to compare with I'm able to narrow down on my original issue.

Giving the train.py setup in the first post, I'm able to convert the first weights file (yolov4-tiny-10.weights) to TF Lite and then to Edge TPU. However, when I'm running the same conversion for a later weights file (like yolov4-tiny-70.weights) from the same training run, I can convert it to TF Lite but not to Edge TPU. This is weird, right? Looking at the generated TF Lite is see a slight difference in the net that might give a clue?! image If you want to take a look yourself: weights and tflites Result of the EdgeTPU Compiler for the first weight file:

Edge TPU Compiler version 15.0.340273435

Model compiled successfully in 1495 ms.

Input model: /content/drive/MyDrive/yolov4-test/aml-new-run-10.tflite
Input size: 5.76MiB
Output model: aml-new-run-10_edgetpu.tflite
Output size: 6.13MiB
On-chip memory used for caching model parameters: 5.69MiB
On-chip memory remaining for caching model parameters: 448.75KiB
Off-chip memory used for streaming uncached model parameters: 5.12KiB
Number of Edge TPU subgraphs: 2
Total number of operations: 132
Operation log: aml-new-run-10_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 102
Number of operations that will run on CPU: 30

Operator                       Count      Status

DEQUANTIZE                     6          Operation is working on an unsupported data type
CONV_2D                        21         Mapped to Edge TPU
QUANTIZE                       27         Mapped to Edge TPU
QUANTIZE                       6          Operation is otherwise supported, but not mapped due to some unspecified limitation
MAX_POOL_2D                    3          Mapped to Edge TPU
ADD                            6          Mapped to Edge TPU
EXP                            6          Operation is working on an unsupported data type
PAD                            2          Mapped to Edge TPU
SPLIT_V                        12         Operation not supported
SUB                            6          Mapped to Edge TPU
MUL                            18         Mapped to Edge TPU
LOGISTIC                       2          Mapped to Edge TPU
SPLIT                          7          Mapped to Edge TPU
CONCATENATION                  9          Mapped to Edge TPU
RESIZE_BILINEAR                1          Mapped to Edge TPU
hhk7734 commented 3 years ago

tensorflow v2.4.0 yolov4 v2.0.2 edgetpu_compiler 15.0.340273435

Hmm.... I tested my model, aml-new-run-10, and aml-new-run-70. Only Aml-new-run-70 failed to compile. I don't know what is wrong.

hhk7734 commented 3 years ago

Plus

When I convert the model without loading the weights file and then compile it, the same error occurs.

keesschollaart81 commented 3 years ago

My guess would be that it has to do with the conversion to Quantized TF Lite. It's adding a quantization step in a place that the edgetpu compiler struggles with. When the weights shift during the training, the converter might choose to place these steps in other places? I dont know, I'm definitely not an expert! :-)

What would you say, should I raise this as an issue for the TF Lite converter or the Edge TPU Compiler, or is this something that needs to be fixed here?

hhk7734 commented 3 years ago

Tensorflow 2 changes fast, but there are some incompatibility even though minor version changes. I don't know if it's a converter problem or a model setup problem. So I don't know what to say.

If you use TensorFlow==v2.2.1 and yolov4<=v2.0.1, perhaps, this error doesn't occur. But accuracy is poor.

keesschollaart81 commented 3 years ago

If you use TensorFlow==v2.2.1 and yolov4<=v2.0.1, perhaps, this error doesn't occur. But accuracy is poor.

I'll give that a try in ~24hr from now!

I also saw that @paradigmn disabled the experimental_new_converter because of a tf.exp() issue (https://github.com/hhk7734/tensorflow-yolov4/issues/20#issuecomment-742047875), is that tracked somewhere? I have some hope this new converter does a better job? I'll also give that a try with tf-nigthly, who knows.

hhk7734 commented 3 years ago

I am trying to change the head(include tf.exp) of the model, but I am not sure if the head change will end before the converter support.