tensorflow / models

Models and examples built with TensorFlow
Other
77.23k stars 45.75k forks source link

CenterNet MobileNetV2 - inference is too slow #10006

Open Paliking opened 3 years ago

Paliking commented 3 years ago

Hi,

I am able to run SSD MobileNetV2 and CenterNet MobileNetV2 (boxes prediction) on my android device. When I compare inference speed of the models on my android device I get below results:

inference of CenterNet MobileNetV2 512x512 is aprox. 5-6 times slower than SSD MobileNetV2 320x320 quantized model. inference of CenterNet MobileNetV2 512x512 is aprox. 3 times slower than SSD MobileNetV2 320x320 non-quantized model. inference of CenterNet MobileNetV2 512x512 has aprox. same speed as SSD MobileNetV2 640x640 quantized model.

I run the models on CPU only, using 4 threads based on the official android tflite tutorial (6 month old, hopefully it is not the reason). I have used CenterNet MobileNetV2 512x512 tflite model directly from here: http://download.tensorflow.org/models/object_detection/tf2/20210210/centernet_mobilenetv2fpn_512x512_coco17_kpts.tar.gz (which has btw input size 320x320 instead of 512x512) and I used also CenterNet MobileNetV2 512x512 tflite model created based on this tutorial: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/centernet_on_device.ipynb both CenterNet MobileNetV2 512x512 tflite models have the same inference speed.

My understanding is the inference of CenterNet MobileNetV2 512x512 should be aprox. 3 times faster than SSD MobileNetV2 320x320 (based on official documentation/benchmark below)

Model name Speed (ms) COCO mAP Outputs
CenterNet MobileNetV2 FPN 512x512 6 23.4 Boxes
SSD MobileNet v2 320x320 19 20.2 Boxes
SSD MobileNet V1 FPN 640x640 48 29.1 Boxes
SSD MobileNet V2 FPNLite 320x320 22 22.2 Boxes
SSD MobileNet V2 FPNLite 640x640 39 28.2 Boxes

What is your experience with inference speed of CenterNet MobileNetV2 512x512? Is it fast for you? Am I missing something? Why is CenterNet MobileNetV2 512x512 so slow for me?

thank you

Paliking commented 3 years ago

Hi,

can someone confirm if you have the same experience with the CenterNet MobileNetV2 512x512?

thank you

rdutta1999 commented 3 years ago

I am experiencing a similar problem. Although, I am using the object detection models and have converted the ssd_mobilenetv2 and centernet_mobilenetv2_fpn models to tfjs_graph_models (both with uint8 quantization), I noticed the following inference times when running through my PC browser (uses a GPU): SSD MobileNetV2: ~120ms CenterNet MobileNetV2 FPN: ~180ms

On mobile phones, it is a lot worse, with almost 1.5x~2x performance degradation.

PS: CenterNet MobileNetV2 FPN 512x512 expects an image of shape 320x320, not 512x512.

b04505009 commented 3 years ago

Maybe there is a mismatch between config and model?

alexdwu13 commented 3 years ago

I had a similar experience running the CenterNet MobileNetV2 tflite model on mobile GPU. My benchmarking tool logged these warnings

...
INFO: Created TensorFlow Lite delegate for GPU.
ERROR: Following operations are not supported by GPU delegate:
CAST: Operation is not supported.
FLOOR_DIV: Operation is not supported.
GATHER_ND: Operation is not supported.
GREATER: Operation is not supported.
LESS: Operation is not supported.
...
95 operations will run on the GPU, and the remaining 41 operations will run on the CPU.

original thread: https://github.com/tensorflow/models/issues/9414#issuecomment-789258818

@srjoglekar246 mentions that these unsupported ops are likely to be small, post-processing ops. However the fact that we all seem to be observing much slower than expected inference times suggests the model may not be fully taking advantage of the GPU.

Has anyone else come across these warnings?

srjoglekar246 commented 3 years ago

@rdutta1999 mentions running on PC/browser, which is not what the TFLite GPU delegate is intended for.

@alexdwu13 WHat device are you using? Also could you mention the inference times you observe on CPU vs GPU?

mattdibi commented 3 years ago

Hi @srjoglekar246,

I'm observing the same behaviour on different devices further aggravated by the lack of speedup after post-training quantization (I observe a slow down in some circumstances).

I'm mainly focused on the following two models:

I followed these two tutorials for exporting my trained models:

Here's the step to reproduce:

Training

SSD Mobilenet:

python3 ${TF_OD_DIR}/models/research/object_detection/model_main_tf2.py \
    --pipeline_config_path workspace/training/ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8.config \
    --model_dir workspace/training/pcn_ssd_mobilenet_v2_fpnlite_320x320/run2 \
    --alsologtostderr

Centernet MobileNet:

python3 ${TF_OD_DIR}/models/research/object_detection/model_main_tf2.py \
    --pipeline_config_path workspace/training/centernet_mobilenetv2fpn_512x512_coco17_od.config \
    --model_dir workspace/training/pcn_centernet_mobilenetv2fpn_512x512/run0q \
    --alsologtostderr

See attached configurations for details

Model export

SSD Mobilenet:

python3 ${TF_OD_DIR}/models/research/object_detection/export_tflite_graph_tf2.py \
    --pipeline_config_path workspace/training/ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8.config \
    --trained_checkpoint_dir workspace/training/pcn_ssd_mobilenet_v2_fpnlite_320x320/run0q \
    --output_directory workspace/exported-models/pcn_ssd_mobilenet_v2_fpnlite_320x320/run0q/tflite

Centernet MobileNet:

python3 ${TF_OD_DIR}/models/research/object_detection/export_tflite_graph_tf2.py \
    --pipeline_config_path workspace/training/centernet_mobilenetv2fpn_512x512_coco17_od.config \
    --trained_checkpoint_dir workspace/training/pcn_centernet_mobilenetv2fpn_512x512/run0q \
    --output_directory workspace/exported-models/pcn_centernet_mobilenetv2fpn/run0q/tflite \
    --centernet_include_keypoints=false \
    --max_detections=10 \
    --config_override=" \
    model{ \
      center_net { \
         image_resizer { \
             fixed_shape_resizer { \
                 height: 320 \
                 width: 320 \
             } \
         } \
      } \
    }"

TFLite conversion

Once I have the model in the tflite-friendly Saved Model format I perform the conversion to .tflite format in three different ways.

1) tflite_convert script (exp1)

Let's refer to the models exported this way as exp1.

tflite_convert --saved_model_dir ../../workspace/exported-models/pcn_centernet_mobilenetv2fpn/run0q/tflite/saved_model --output_file pcn_centernet_run0q.tflite

2) Dynamic range quantization (exp2)

Let's refer to the models exported this way as exp2.

#!/usr/bin/env python3
# coding: utf-8

import argparse

# Define and parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument('--modeldir', help='Folder the SavedModel is located in',
                    required=True)
parser.add_argument('--output', help='Path to output file',
                    required=True)

args = parser.parse_args()

SAVED_MODEL_DIR = args.modeldir
OUTPUT_MODEL    = args.output

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

# Save the model.
with open(OUTPUT_MODEL, 'wb') as f:
  f.write(tflite_quant_model)

I run the script with:

./tflite_convert_quantize.py --modeldir ../../workspace/exported-models/pcn_centernet_mobilenetv2fpn/run0q/tflite/saved_model --output_file pcn_centernet_run0q_dyn_range.tflite

3) Full integer quantization (exp3)

Let's refer to the models exported this way as exp3.

#!/usr/bin/env python3
# coding: utf-8

import random
import numpy as np
from glob import glob

def preprocess(image, height, width):
    if image.dtype != tf.float32:
        image = tf.image.convert_image_dtype(image, dtype=tf.float32)

    # Resize the image to the specified height and width.
    image = tf.expand_dims(image, 0)
    image = tf.image.resize_bilinear(image, [height, width],
                                   align_corners=False)
    image = tf.squeeze(image, [0])

    image = tf.subtract(image, 0.5)
    image = tf.multiply(image, 2.0)
    return image

def representative_dataset():
    files = glob('../../workspace/images/test/*.jpg')
    random.shuffle(files)
    files = files[:128]
    for file in files:
        image = tf.io.read_file(file)
        image = tf.compat.v1.image.decode_jpeg(image)
        image = preprocess(image, 320, 320)

        yield [image]

import argparse

# Define and parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument('--modeldir', help='Folder the SavedModel is located in',
                    required=True)
parser.add_argument('--output', help='Path to output file',
                    required=True)

args = parser.parse_args()

SAVED_MODEL_DIR = args.modeldir
OUTPUT_MODEL    = args.output

import tensorflow as tf

# Refer to: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_mobile_tf2.md#step-2-convert-to-tflite
converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL_DIR)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8, tf.lite.OpsSet.TFLITE_BUILTINS]
converter.representative_dataset = representative_dataset
# converter.inference_input_type  = tf.uint8  # or tf.uint8
# converter.inference_output_type = tf.uint8  # or tf.uint8
tflite_quant_model = converter.convert()

# Save the model.
with open(OUTPUT_MODEL, 'wb') as f:
  f.write(tflite_quant_model)

I run the script with:

./tflite_convert_quantize_full_int.py --modeldir ../../workspace/exported-models/pcn_centernet_mobilenetv2fpn/run0q/tflite/saved_model --output_file pcn_centernet_run0q_full_quant.tflite

Benchmark

I then perform the benchmarks using the benchmarking tool.

The result are the following:

Model Export Avg Inference
Centernet exp1 209 ms
Centernet exp2 289 ms
Centernet exp3 211 ms
SSD FPNLite exp1 182 ms
SSD FPNLite exp2 320 ms
SSD FPNLite exp3 115 ms

The quantized model is slower than the floating-point model. This is not what we expect to happen. We should see a 2x/3x speedup.

Furthermore, by your documentation, the Centernet model should be 3x faster than the SSD-based one.

Details

Library versions

Environment details

I'm running the models on a custom architecture (Arm 64) but can replicate the behaviour on my workstation (Ubuntu 19.10 x86_64) and on my Raspberry Pi 4. Everything is run on the CPU.

Benchmark details

Centernet exp1 ```sh / # ./linux_aarch64/bin/benchmark_model --graph=/usr/graphs/pcn_centernet_run0q.tflite --num_threads=4 STARTING! Log parameter values verbosely: [0] Num threads: [4] Graph: [/usr/graphs/pcn_centernet_run0q.tflite] Loaded model /usr/graphs/pcn_centernet_run0q.tflite The input model file size (MB): 9.35788 Initialized session in 19.975ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=3 first=229950 curr=210605 min=209598 max=229950 avg=216718 std=9365 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=208881 curr=210369 min=208633 max=213102 avg=209931 std=849 Inference timings in us: Init: 19975, First inference: 229950, Warmup (avg): 216718, Inference (avg): 209931 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=3.08203 overall=36.8633 ```
Centernet exp2 ```sh / # ./linux_aarch64/bin/benchmark_model --graph=/usr/graphs/pcn_centernet_run0q_dyn_range.tflite --num_threads=4 STARTING! Log parameter values verbosely: [0] Num threads: [4] Graph: [/usr/graphs/pcn_centernet_run0q_dyn_range.tflite] Loaded model /usr/graphs/pcn_centernet_run0q_dyn_range.tflite The input model file size (MB): 2.83549 Initialized session in 4.773ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=2 first=304200 curr=283164 min=283164 max=304200 avg=293682 std=10518 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=284196 curr=283994 min=282502 max=285762 avg=283974 std=859 Inference timings in us: Init: 4773, First inference: 304200, Warmup (avg): 293682, Inference (avg): 283974 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=5.81641 overall=38.6484 ```
Centernet exp3 ```sh / # ./linux_aarch64/bin/benchmark_model --graph=/usr/graphs/pcn_centernet_run0q_full_quant.tflite --num_threads=4 STARTING! Log parameter values verbosely: [0] Num threads: [4] Graph: [/usr/graphs/pcn_centernet_run0q_full_quant.tflite] Loaded model /usr/graphs/pcn_centernet_run0q_full_quant.tflite The input model file size (MB): 9.35767 Initialized session in 2.725ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=3 first=228120 curr=210386 min=210386 max=228120 avg=216727 std=8073 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=212029 curr=210534 min=210383 max=213942 avg=211509 std=760 Inference timings in us: Init: 2725, First inference: 228120, Warmup (avg): 216727, Inference (avg): 211509 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=4.02344 overall=38.2773 ```
SSD FPNLite exp1 ```sh / # ./linux_aarch64/bin/benchmark_model --graph=/usr/src/graphs/pcn_ssd_mobilenet_v2.tflite --num_threads=4 STARTING! Log parameter values verbosely: [0] Num threads: [4] Graph: [/usr/src/graphs/pcn_ssd_mobilenet_v2.tflite] Loaded model /usr/src/graphs/pcn_ssd_mobilenet_v2.tflite The input model file size (MB): 18.597 Initialized session in 2.456ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=4 first=144213 curr=133841 min=133261 max=144213 avg=136884 std=4374 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=133661 curr=135445 min=133188 max=136577 avg=134058 std=830 Inference timings in us: Init: 2456, First inference: 144213, Warmup (avg): 136884, Inference (avg): 134058 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=3.91016 overall=35.4414 ```
SSD FPNLite exp2 ```sh / # ./linux_aarch64/bin/benchmark_model --graph=/usr/src/graphs/pcn_ssd_mobilenet_v2_fpnlite_320x320_run1 _dyn_quant.tflite --num_threads=4 STARTING! Log parameter values verbosely: [0] Num threads: [4] Graph: [/usr/src/graphs/pcn_ssd_mobilenet_v2_fpnlite_320x320_run1_dyn_quant.tflite] Loaded model /usr/src/graphs/pcn_ssd_mobilenet_v2_fpnlite_320x320_run1_dyn_quant.tflite The input model file size (MB): 3.61147 Initialized session in 7.829ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=2 first=332830 curr=319233 min=319233 max=332830 avg=326032 std=6798 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=318424 curr=320376 min=317853 max=333894 avg=320568 std=2449 Inference timings in us: Init: 7829, First inference: 332830, Warmup (avg): 326032, Inference (avg): 320568 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=5.70703 overall=35.1914 ```
SSD FPNLite exp3 ```sh / # ./linux_aarch64/bin/benchmark_model --graph=/usr/src/graphs/pcn_ssd_mobilenet_v2_fpnlite_320x320_run1_exp_quant.tflite --num_threads=4 STARTING! Log parameter values verbosely: [0] Num threads: [4] Graph: [/usr/src/graphs/pcn_ssd_mobilenet_v2_fpnlite_320x320_run1_exp_quant.tflite] Loaded model /usr/src/graphs/pcn_ssd_mobilenet_v2_fpnlite_320x320_run1_exp_quant.tflite The input model file size (MB): 3.76347 Initialized session in 7.27ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=5 first=115710 curr=110998 min=110998 max=115710 avg=112369 std=1710 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=112161 curr=110968 min=110852 max=112818 avg=111410 std=567 Inference timings in us: Init: 7270, First inference: 115710, Warmup (avg): 112369, Inference (avg): 111410 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Peak memory footprint (MB): init=6.71484 overall=12.375 ```

Training configurations

srjoglekar246 commented 3 years ago

@mattdibi Can you try increasing the number of threads being used by the TFLite Interpreter? Also note that our kernels are optimized for the Arm NEON instruction set, which may or may not be available on all the devices you mentioned. In the benchmarks you showed, SSD does indeed perform faster with full-int quantization than on floating point.

Note that the numbers we mention on our devsite etc depend on the model you are trying to run. For the CenterNet, the issue is that there are a lot of operations in the model which quantization doesn't really support. So we end up actually spending time quantizing & dequantizing tensors in the graph whenever non-quantized ops are encountered.

srjoglekar246 commented 3 years ago

@teijeong FYI for quantization issues.

mattdibi commented 3 years ago

@srjoglekar246 thank you for your quick reply.

Can you try increasing the number of threads being used by the TFLite Interpreter?

I only have 4 cores available on my target machine and on the Raspberry.

Also note that our kernels are optimized for the Arm NEON instruction set, which may or may not be available on all the devices you mentioned. In the benchmarks you showed, SSD does indeed perform faster with full-int quantization than on floating point.

I double checked and the instruction set should be available (I used #include <sys/auxv.h> and #include <asm/hwcap.h> with getauxval(AT_HWCAP) & HWCAP_NEON)

Also I realized I made a mistake during the Centernet quantization process, it actually goes a little faster(~180ms) when quantized using float fallback. But it's still slower than SSD.

Note that the numbers we mention on our devsite etc depend on the model you are trying to run. For the CenterNet, the issue is that there are a lot of operations in the model which quantization doesn't really support. So we end up actually spending time quantizing & dequantizing tensors in the graph whenever non-quantized ops are encountered.

I understand but then, how in the documentation Centernet is said to be 3x faster than SSD? What are the condition to reproduce the performance reported by the documentation?

firststep-dev commented 1 year ago

I am experiencing a similar problem on Android.

SSD MobileNet v2 320x320 (Float16 quantized)

CenterNet MobileNetV2 FPN 512x512 (Float 32 quantized)

Is there an mistake in the published benchmark performance scores of CenterNet MobileNetV2 FPN on TF2 Zoo page?

Usefff commented 1 year ago

I'm stuck with same problem. Inference on my Huawei Mate P20Pro with CenterNet MobileNetV2 FPN 512x512 (F32) - 212ms

Shubham654 commented 1 year ago

Hello @alexdwu13 @srjoglekar246 Still facing the same problem for GPU delegate for centernet_mobilenetv2fpn_512x512_coco17_kpts. It's been couple of months we are facing this problems of unsupported ops. Any fixes or solutions for below problem?

INFO: Initialized TensorFlow Lite runtime. INFO: Created TensorFlow Lite delegate for GPU. ERROR: Following operations are not supported by GPU delegate: ADD: OP is supported, but tensor type isn't matched! ARG_MIN: Operation is not supported. CAST: Operation is not supported. FLOOR_DIV: Operation is not supported. GATHER_ND: Operation is not supported. GREATER: Operation is not supported. GREATER_EQUAL: Operation is not supported. LESS: Operation is not supported. MUL: OP is supported, but tensor type isn't matched! NOT_EQUAL: Operation is not supported. PACK: OP is supported, but tensor type isn't matched! RESHAPE: OP is supported, but tensor type isn't matched! SELECT: Operation is not supported. STRIDED_SLICE: STRIDED_SLICE supports for 3 or 4 dimensional tensors only. STRIDED_SLICE: Slice does not support shrink_axis_mask parameter. SUB: OP is supported, but tensor type isn't matched! SUM: OP is supported, but tensor type isn't matched! TILE: Operation is not supported. TOPK_V2: Operation is not supported. TRANSPOSE: OP is supported, but tensor type isn't matched! UNPACK: Operation is not supported. 111 operations will run on the GPU, and the remaining 166 operations will run on the CPU. INFO: Initialized OpenCL-based API. INFO: Created 1 GPU delegate kernels.