Open Paliking opened 3 years ago
Hi,
can someone confirm if you have the same experience with the CenterNet MobileNetV2 512x512?
thank you
I am experiencing a similar problem. Although, I am using the object detection models and have converted the ssd_mobilenetv2
and centernet_mobilenetv2_fpn models
to tfjs_graph_models (both with uint8 quantization), I noticed the following inference times when running through my PC browser (uses a GPU):
SSD MobileNetV2: ~120ms
CenterNet MobileNetV2 FPN: ~180ms
On mobile phones, it is a lot worse, with almost 1.5x~2x performance degradation.
PS: CenterNet MobileNetV2 FPN 512x512 expects an image of shape 320x320, not 512x512.
Maybe there is a mismatch between config and model?
I had a similar experience running the CenterNet MobileNetV2
tflite model on mobile GPU. My benchmarking tool logged these warnings
...
INFO: Created TensorFlow Lite delegate for GPU.
ERROR: Following operations are not supported by GPU delegate:
CAST: Operation is not supported.
FLOOR_DIV: Operation is not supported.
GATHER_ND: Operation is not supported.
GREATER: Operation is not supported.
LESS: Operation is not supported.
...
95 operations will run on the GPU, and the remaining 41 operations will run on the CPU.
original thread: https://github.com/tensorflow/models/issues/9414#issuecomment-789258818
@srjoglekar246 mentions that these unsupported ops are likely to be small, post-processing ops. However the fact that we all seem to be observing much slower than expected inference times suggests the model may not be fully taking advantage of the GPU.
Has anyone else come across these warnings?
@rdutta1999 mentions running on PC/browser, which is not what the TFLite GPU delegate is intended for.
@alexdwu13 WHat device are you using? Also could you mention the inference times you observe on CPU vs GPU?
Hi @srjoglekar246,
I'm observing the same behaviour on different devices further aggravated by the lack of speedup after post-training quantization (I observe a slow down in some circumstances).
I'm mainly focused on the following two models:
I followed these two tutorials for exporting my trained models:
Here's the step to reproduce:
SSD Mobilenet:
python3 ${TF_OD_DIR}/models/research/object_detection/model_main_tf2.py \
--pipeline_config_path workspace/training/ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8.config \
--model_dir workspace/training/pcn_ssd_mobilenet_v2_fpnlite_320x320/run2 \
--alsologtostderr
Centernet MobileNet:
python3 ${TF_OD_DIR}/models/research/object_detection/model_main_tf2.py \
--pipeline_config_path workspace/training/centernet_mobilenetv2fpn_512x512_coco17_od.config \
--model_dir workspace/training/pcn_centernet_mobilenetv2fpn_512x512/run0q \
--alsologtostderr
See attached configurations for details
SSD Mobilenet:
python3 ${TF_OD_DIR}/models/research/object_detection/export_tflite_graph_tf2.py \
--pipeline_config_path workspace/training/ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8.config \
--trained_checkpoint_dir workspace/training/pcn_ssd_mobilenet_v2_fpnlite_320x320/run0q \
--output_directory workspace/exported-models/pcn_ssd_mobilenet_v2_fpnlite_320x320/run0q/tflite
Centernet MobileNet:
python3 ${TF_OD_DIR}/models/research/object_detection/export_tflite_graph_tf2.py \
--pipeline_config_path workspace/training/centernet_mobilenetv2fpn_512x512_coco17_od.config \
--trained_checkpoint_dir workspace/training/pcn_centernet_mobilenetv2fpn_512x512/run0q \
--output_directory workspace/exported-models/pcn_centernet_mobilenetv2fpn/run0q/tflite \
--centernet_include_keypoints=false \
--max_detections=10 \
--config_override=" \
model{ \
center_net { \
image_resizer { \
fixed_shape_resizer { \
height: 320 \
width: 320 \
} \
} \
} \
}"
Once I have the model in the tflite-friendly Saved Model format I perform the conversion to .tflite format in three different ways.
Let's refer to the models exported this way as exp1.
tflite_convert --saved_model_dir ../../workspace/exported-models/pcn_centernet_mobilenetv2fpn/run0q/tflite/saved_model --output_file pcn_centernet_run0q.tflite
Let's refer to the models exported this way as exp2.
#!/usr/bin/env python3
# coding: utf-8
import argparse
# Define and parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument('--modeldir', help='Folder the SavedModel is located in',
required=True)
parser.add_argument('--output', help='Path to output file',
required=True)
args = parser.parse_args()
SAVED_MODEL_DIR = args.modeldir
OUTPUT_MODEL = args.output
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
# Save the model.
with open(OUTPUT_MODEL, 'wb') as f:
f.write(tflite_quant_model)
I run the script with:
./tflite_convert_quantize.py --modeldir ../../workspace/exported-models/pcn_centernet_mobilenetv2fpn/run0q/tflite/saved_model --output_file pcn_centernet_run0q_dyn_range.tflite
Let's refer to the models exported this way as exp3.
#!/usr/bin/env python3
# coding: utf-8
import random
import numpy as np
from glob import glob
def preprocess(image, height, width):
if image.dtype != tf.float32:
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
# Resize the image to the specified height and width.
image = tf.expand_dims(image, 0)
image = tf.image.resize_bilinear(image, [height, width],
align_corners=False)
image = tf.squeeze(image, [0])
image = tf.subtract(image, 0.5)
image = tf.multiply(image, 2.0)
return image
def representative_dataset():
files = glob('../../workspace/images/test/*.jpg')
random.shuffle(files)
files = files[:128]
for file in files:
image = tf.io.read_file(file)
image = tf.compat.v1.image.decode_jpeg(image)
image = preprocess(image, 320, 320)
yield [image]
import argparse
# Define and parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument('--modeldir', help='Folder the SavedModel is located in',
required=True)
parser.add_argument('--output', help='Path to output file',
required=True)
args = parser.parse_args()
SAVED_MODEL_DIR = args.modeldir
OUTPUT_MODEL = args.output
import tensorflow as tf
# Refer to: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_mobile_tf2.md#step-2-convert-to-tflite
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8, tf.lite.OpsSet.TFLITE_BUILTINS]
converter.representative_dataset = representative_dataset
# converter.inference_input_type = tf.uint8 # or tf.uint8
# converter.inference_output_type = tf.uint8 # or tf.uint8
tflite_quant_model = converter.convert()
# Save the model.
with open(OUTPUT_MODEL, 'wb') as f:
f.write(tflite_quant_model)
I run the script with:
./tflite_convert_quantize_full_int.py --modeldir ../../workspace/exported-models/pcn_centernet_mobilenetv2fpn/run0q/tflite/saved_model --output_file pcn_centernet_run0q_full_quant.tflite
I then perform the benchmarks using the benchmarking tool.
The result are the following:
Model | Export | Avg Inference |
---|---|---|
Centernet | exp1 | 209 ms |
Centernet | exp2 | 289 ms |
Centernet | exp3 | 211 ms |
SSD FPNLite | exp1 | 182 ms |
SSD FPNLite | exp2 | 320 ms |
SSD FPNLite | exp3 | 115 ms |
The quantized model is slower than the floating-point model. This is not what we expect to happen. We should see a 2x/3x speedup.
Furthermore, by your documentation, the Centernet model should be 3x faster than the SSD-based one.
I'm running the models on a custom architecture (Arm 64) but can replicate the behaviour on my workstation (Ubuntu 19.10 x86_64) and on my Raspberry Pi 4. Everything is run on the CPU.
@mattdibi Can you try increasing the number of threads being used by the TFLite Interpreter? Also note that our kernels are optimized for the Arm NEON instruction set, which may or may not be available on all the devices you mentioned. In the benchmarks you showed, SSD does indeed perform faster with full-int quantization than on floating point.
Note that the numbers we mention on our devsite etc depend on the model you are trying to run. For the CenterNet, the issue is that there are a lot of operations in the model which quantization doesn't really support. So we end up actually spending time quantizing & dequantizing tensors in the graph whenever non-quantized ops are encountered.
@teijeong FYI for quantization issues.
@srjoglekar246 thank you for your quick reply.
Can you try increasing the number of threads being used by the TFLite Interpreter?
I only have 4 cores available on my target machine and on the Raspberry.
Also note that our kernels are optimized for the Arm NEON instruction set, which may or may not be available on all the devices you mentioned. In the benchmarks you showed, SSD does indeed perform faster with full-int quantization than on floating point.
I double checked and the instruction set should be available (I used #include <sys/auxv.h>
and #include <asm/hwcap.h>
with getauxval(AT_HWCAP) & HWCAP_NEON
)
Also I realized I made a mistake during the Centernet quantization process, it actually goes a little faster(~180ms) when quantized using float fallback. But it's still slower than SSD.
Note that the numbers we mention on our devsite etc depend on the model you are trying to run. For the CenterNet, the issue is that there are a lot of operations in the model which quantization doesn't really support. So we end up actually spending time quantizing & dequantizing tensors in the graph whenever non-quantized ops are encountered.
I understand but then, how in the documentation Centernet is said to be 3x faster than SSD? What are the condition to reproduce the performance reported by the documentation?
I am experiencing a similar problem on Android.
SSD MobileNet v2 320x320 (Float16 quantized)
CenterNet MobileNetV2 FPN 512x512 (Float 32 quantized)
Is there an mistake in the published benchmark performance scores of CenterNet MobileNetV2 FPN on TF2 Zoo page?
I'm stuck with same problem. Inference on my Huawei Mate P20Pro with CenterNet MobileNetV2 FPN 512x512 (F32) - 212ms
Hello @alexdwu13 @srjoglekar246 Still facing the same problem for GPU delegate for centernet_mobilenetv2fpn_512x512_coco17_kpts. It's been couple of months we are facing this problems of unsupported ops. Any fixes or solutions for below problem?
INFO: Initialized TensorFlow Lite runtime. INFO: Created TensorFlow Lite delegate for GPU. ERROR: Following operations are not supported by GPU delegate: ADD: OP is supported, but tensor type isn't matched! ARG_MIN: Operation is not supported. CAST: Operation is not supported. FLOOR_DIV: Operation is not supported. GATHER_ND: Operation is not supported. GREATER: Operation is not supported. GREATER_EQUAL: Operation is not supported. LESS: Operation is not supported. MUL: OP is supported, but tensor type isn't matched! NOT_EQUAL: Operation is not supported. PACK: OP is supported, but tensor type isn't matched! RESHAPE: OP is supported, but tensor type isn't matched! SELECT: Operation is not supported. STRIDED_SLICE: STRIDED_SLICE supports for 3 or 4 dimensional tensors only. STRIDED_SLICE: Slice does not support shrink_axis_mask parameter. SUB: OP is supported, but tensor type isn't matched! SUM: OP is supported, but tensor type isn't matched! TILE: Operation is not supported. TOPK_V2: Operation is not supported. TRANSPOSE: OP is supported, but tensor type isn't matched! UNPACK: Operation is not supported. 111 operations will run on the GPU, and the remaining 166 operations will run on the CPU. INFO: Initialized OpenCL-based API. INFO: Created 1 GPU delegate kernels.
Hi,
I am able to run SSD MobileNetV2 and CenterNet MobileNetV2 (boxes prediction) on my android device. When I compare inference speed of the models on my android device I get below results:
inference of CenterNet MobileNetV2 512x512 is aprox. 5-6 times slower than SSD MobileNetV2 320x320 quantized model. inference of CenterNet MobileNetV2 512x512 is aprox. 3 times slower than SSD MobileNetV2 320x320 non-quantized model. inference of CenterNet MobileNetV2 512x512 has aprox. same speed as SSD MobileNetV2 640x640 quantized model.
I run the models on CPU only, using 4 threads based on the official android tflite tutorial (6 month old, hopefully it is not the reason). I have used CenterNet MobileNetV2 512x512 tflite model directly from here: http://download.tensorflow.org/models/object_detection/tf2/20210210/centernet_mobilenetv2fpn_512x512_coco17_kpts.tar.gz (which has btw input size 320x320 instead of 512x512) and I used also CenterNet MobileNetV2 512x512 tflite model created based on this tutorial: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/centernet_on_device.ipynb both CenterNet MobileNetV2 512x512 tflite models have the same inference speed.
My understanding is the inference of CenterNet MobileNetV2 512x512 should be aprox. 3 times faster than SSD MobileNetV2 320x320 (based on official documentation/benchmark below)
What is your experience with inference speed of CenterNet MobileNetV2 512x512? Is it fast for you? Am I missing something? Why is CenterNet MobileNetV2 512x512 so slow for me?
thank you