support for quantized tflite models

google / XNNPACK

High-efficiency floating-point neural network inference operators for mobile, server, and Web

Other

1.84k stars 357 forks source link

support for quantized tflite models #999

Closed honglh closed 4 years ago

honglh commented 4 years ago

Hi, currently it seems only f32 delegates to xnnpack is supported, though the qs8~/qu8~ operators available in xnnpack. Is it on roadmap to add runtime support so quantized tflite models can also delegate to xnnpack qs8/qu8 operators?

Thanks!

Maratyszcza commented 4 years ago

Yes, quantized operators are work-in-progress

ephemer commented 3 years ago

Just wondering if this is still in the pipeline? From what I understand this was working fine on Android last I checked but not yet in tfjs.

Sorry to revive an old thread but XNNPACK doesn't seem to benefit from a website with the available features and/or release announcements like tensor flow does. (AFAIK?)

Maratyszcza commented 3 years ago

You can try quantized inference in the latest version of the XNNPACK delegate for TensorFlow Lite. The following limitations currently apply:

Only ADD, CONV_2D, DEPTHWISE_CONV_2D, and FULLY_CONNECTED operators are supported.
Only signed per-tensor quantization is supported. You typically get signed per-tensor quantization by doing quantization-aware training with Model Optimization Toolkit.
Quantized inference in XNNPACK is disabled by default, build with --define xnn_enable_qs8=true to enable it

AIWintermuteAI commented 3 years ago

Hi, @Maratyszcza ! Thank you for putting great amount of effort working on XNNPACK!

I have tried quantized inference in the latest version of the XNNPACK delegate for TensorFlow Lite, but it didn't bring any speed improvement as compared to vanilla Tensorflow Lite INT8 inference. What I did was: 1) compile tensorflow lite with XNNPACK and --define xnn_enable_qs8=true enabled I used https://github.com/tensorflow/tensorflow/commit/800e426150326e399f8b11215d56d58a4ad0b3ee#diff-27433f473c0183510f3ba5e9837691a75580f0f0437af8dc082ecdb21074d7cb as it's last relevant commit for XNNPACK and compiled aarch64 binary with the following bazel build flags:

# Build python interpreter_wrapper.
cd "${BUILD_DIR}"
case "${TENSORFLOW_TARGET}" in
  armhf)
    BAZEL_FLAGS="--config=elinux_armhf
      --copt=-march=armv7-a --copt=-mfpu=neon-vfpv4
      --copt=-O3 --copt=-fno-tree-pre --copt=-fpermissive
      --define tensorflow_mkldnn_contraction_kernel=0
      --define=raspberry_pi_with_neon=true
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true
      --define=enable_int8_weights_unpacking=true"
    ;;
  aarch64)
    BAZEL_FLAGS="--config=elinux_aarch64
      --define tensorflow_mkldnn_contraction_kernel=0
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true
      --define=enable_int8_weights_unpacking=true
      --copt=-O3"
    ;;
  native)
    BAZEL_FLAGS="--copt=-O3 --copt=-march=native
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true
      --define=enable_int8_weights_unpacking=true"
    ;;
  *)
    BAZEL_FLAGS="--copt=-O3
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true
      --define=enable_int8_weights_unpacking=true"      
    ;;
esac

Here is compiled binary, you can examine it - it is primarily to be used with Raspberry Pi OS 64-bit image. https://drive.google.com/file/d/1UR8hg3ez8LbWF-Mz3vL9yMYN-VmXBYzG/view?usp=sharing

2) applied QAT to trained model (MobileNet v1 alpha 1.0 backend + YOLOv3 detection layer, single branch, so no UpSampling). Here is resulting and original models https://drive.google.com/file/d/1ClGbO7N3sqkfthOJPfXm4SHoQ8s8mXFq/view?usp=sharing https://drive.google.com/file/d/15MZ82rDC0W5rMTnT7ZWSE0F4GJthR14a/view?usp=sharing 3) Converted QAT model to .tflite and executed it on Raspberry Pi 4 https://drive.google.com/file/d/1mPIdiAOMosIs-4ELebdIBLkZWAC9Jn54/view?usp=sharing The results for inference on video file with latest XNNPACK are:

pi@raspberrypi:~/raspberry_pi/detector_v3 $ source ~/tflite-new/bin/activate
(tflite-new) pi@raspberrypi:~/raspberry_pi/detector_v3 $ python3 detector_file.py --model yolo_best_recall_qat.tflite --labels labels.txt --file ../samples/test_s.mp4 
4.5.2
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Processing frames:   0%|                                                                                                                    | 0/141 [00:00<?, ?it/s]/home/pi/raspberry_pi/detector_v3/box.py:45: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return np.array(temp_list)
Processing frames: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:16<00:00,  8.68it/s]
Finished processing frames
Average time(ms):  86.0
FPS:  11.627906976744185

The result for vanilla tflite runtime 2.5.0 (downloaded with pip):

(tflite-new) pi@raspberrypi:~/raspberry_pi/detector_v3 $ source ~/tflite-vanilla/bin/activate
(tflite-vanilla) pi@raspberrypi:~/raspberry_pi/detector_v3 $ python3 detector_file.py --model yolo_best_recall_qat.tflite --labels labels.txt --file ../samples/test_s.mp4 
4.5.2
Processing frames: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:16<00:00,  8.67it/s]
Finished processing frames
Average time(ms):  85.0
FPS:  11.764705882352942

If necessary I can provide scripts and environment specs used for inference too. Is there something wrong with model?

Maratyszcza commented 3 years ago

The CONV_2D and DEPTHWISE_CONV_2D operators in your TFLite models use per-channel quantization, while XNNPACK supports only per-tensor. You should train the model using per-tensor quantization, or wait for ~1 week until we fully ship per-channel quantization support in XNNPACK and its TFLite delegate.

AIWintermuteAI commented 3 years ago

Thanks for the information! Yes, it does seem that QAT in tfmot switched from per-tensor to per-channel a while ago, according to this comment https://github.com/tensorflow/tensorflow/issues/34299#issuecomment-606735890

There is some information here on how to enable per-tensor quantization here https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide#setup_defaultdensequantizeconfig So I'll try doing that for now and wait for per-channel quantization support in XNNPACK and its TFLite delegate.

AIWintermuteAI commented 3 years ago

Hello, @Maratyszcza ! Seeing this commit https://github.com/tensorflow/tensorflow/commit/f9477365d39a359275c603a5d8ad37eeb31b1450#diff-27433f473c0183510f3ba5e9837691a75580f0f0437af8dc082ecdb21074d7cb added per-channel quantization support I tried to compile this specific commit, but wasn't successful (it does say some internal CLI has failed), so I picked a more recent commit https://github.com/tensorflow/tensorflow/commit/86fa4da8cbf8d3a01292d630113ad7d3bc4b50c8 which was passing all the checks.

You can find both FLOAT32 and INT8 models, original Keras model and Python 3.7 aarch64 wheel in the following Google drive folder I shared: https://drive.google.com/drive/folders/14LKO_hbc4VeTri8k25zhRJleRc7cLF_T?usp=sharing

Here are my testing results, run on Raspberry Pi 4 with FLOAT32 model and INT8 model

tflite-new is tensorflow lite compiled from https://github.com/tensorflow/tensorflow/commit/86fa4da8cbf8d3a01292d630113ad7d3bc4b50c8 (2.7.0) and tflite-vanilla is tensorflow lite runtime installed with pip (2.5.0 from https://google-coral.github.io/py-repo/) Both run with num_threads = 4.

-	tflite-new	tflite-vanilla
FLOAT32	132 ms.	125 ms.
INT8	77 ms.	72 ms.

So the new version is slightly slower for both FP32/INT8... I'm a bit confused about the results, are there some other requirements I'm missing?

Maratyszcza commented 3 years ago

@AIWintermuteAI As you're running with 4 threads, I suggest to first ensure that your RPi isn't thermally throttled, as this commonly happens unless you use special cooling. Run vcgencmd get_throttled to check if RPi was throttled since boot.

AIWintermuteAI commented 3 years ago

Thanks for the suggestion! I installed the cooling tower on R Pi 4 and run the tests again. Now new tflite and vanilla tflite performance seem to be almost the same, but still not faster.... Below are testing output and script used in testing:

pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ source ~/tflite-vanilla/bin/activate
(tflite-vanilla) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ python3 detector_file.py --model yolo_best_recall_int8.tflite --labels labels.txt --file ~/raspberry_pi/samples/test_s.mp4 
OpenCV version: 4.5.2
Processing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:13<00:00, 10.16it/s]
Finished processing frames
Average time(ms):  75.0
FPS:  13.333333333333334

(tflite-vanilla) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ python3 detector_file.py --model yolo_best_recall_float32.tflite --labels labels.txt --file ~/raspberry_pi/samples/test_s.mp4 
OpenCV version: 4.5.2
Processing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:22<00:00,  6.18it/s]
Finished processing frames
Average time(ms):  131.0
FPS:  7.633587786259542

(tflite-vanilla) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ deactivate
pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ source ~/tflite-new/bin/activate

(tflite-new) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ python3 detector_file.py --model yolo_best_recall_int8.tflite --labels labels.txt --file ~/raspberry_pi/samples/test_s.mp4 
OpenCV version: 4.5.2
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Processing frames:   0%|                                                                                                                                           | 0/141 [00:00<?, ?it/s]/home/pi/aXeleRate/example_scripts/tensorflow_lite/detector/cv_utils.py:349: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return np.array(temp_list)
Processing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:15<00:00,  9.33it/s]
Finished processing frames
Average time(ms):  77.0
FPS:  12.987012987012987

(tflite-new) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ python3 detector_file.py --model yolo_best_recall_float32.tflite --labels labels.txt --file ~/raspberry_pi/samples/test_s.mp4 
OpenCV version: 4.5.2
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Processing frames:   0%|                                                                                                                                           | 0/141 [00:00<?, ?it/s]/home/pi/aXeleRate/example_scripts/tensorflow_lite/detector/cv_utils.py:349: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return np.array(temp_list)
Processing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:21<00:00,  6.43it/s]
Finished processing frames
Average time(ms):  131.0
FPS:  7.633587786259542
(tflite-new) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ vcgencmd get_throttled
throttled=0x0

Code:

import time
import argparse
import os
import cv2
import numpy as np
from tqdm import tqdm

from cv_utils import init_video_file_capture, decode_yolov3, draw_bounding_boxes, preprocess
from tflite_runtime.interpreter import Interpreter

def load_labels(path):
    with open(path, 'r') as f:
        return {i: line.strip() for i, line in enumerate(f.read().replace('"','').split(','))}

class NetworkExecutor(object):

    def __init__(self, model_file):

        self.interpreter = Interpreter(model_file, num_threads=4)
        self.interpreter.allocate_tensors()
        _, self.input_height, self.input_width, _ = self.interpreter.get_input_details()[0]['shape']
        self.tensor_index = self.interpreter.get_input_details()[0]['index']

    def get_output_tensors(self):

      output_details = self.interpreter.get_output_details()
      tensor_indices = []
      tensor_list = []

      for output in output_details:
            tensor = np.squeeze(self.interpreter.get_tensor(output['index']))
            tensor_list.append(tensor)

      return tensor_list

    def run(self, image):
        if image.shape[1:2] != (self.input_height, self.input_width):
            img = cv2.resize(image, (self.input_width, self.input_height))
        img = preprocess(img)
        self.interpreter.set_tensor(self.tensor_index, img)
        self.interpreter.invoke()
        return self.get_output_tensors()

def main(args, detector):
    video, video_writer, frame_count = init_video_file_capture(args.file, 'detector_demo')

    if not os.path.exists(args.labels[0]):
        labels = args.labels
    else:   
        labels = load_labels(args.labels[0])

    frame_num = len(frame_count)
    times = []

    for _ in tqdm(frame_count, desc='Processing frames'):
        frame_present, frame = video.read()
        if not frame_present:
            continue

        start_time = time.time()
        results = detection_network.run(frame)
        elapsed_ms = (time.time() - start_time) * 1000

        detections = decode_yolov3(netout = results, threshold = args.threshold)

        draw_bounding_boxes(frame, detections, labels)

        times.append(elapsed_ms)
        video_writer.write(frame)

    print('Finished processing frames')
    video.release(), video_writer.release()

    print("Average time(ms): ", sum(times)//frame_num) 
    print("FPS: ", 1000.0 / (sum(times)//frame_num)) # FPS = 1 / time to process loop

if __name__ == "__main__" :

    print("OpenCV version: {}".format(cv2. __version__))

    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--model', help='File path of .tflite file.', required=True)
    parser.add_argument('--labels', nargs="+", help='File path of labels file.', required=True)
    parser.add_argument('--threshold', help='Confidence threshold.', default=0.7)
    parser.add_argument('--file', help='File path of video file', default=None)
    args = parser.parse_args()

    detection_network = NetworkExecutor(args.model)

    main(args, detection_network)

Maratyszcza commented 3 years ago

@AIWintermuteAI please check the following:

You're building XNNPACK and XNNPACK delegate for TensorFlow Lite with --define xnn_enable_qs8=true Bazel option.
You're running TensorFlow Lite with the XNNPACK delegate.
The model uses signed quantization schema, and all activations and weights are INT8 type.
You're running an ARM64 binary.

AIWintermuteAI commented 3 years ago

Yes to all 4, so it does seem either I'm missing something very obvious or there is something wrong with the model... To give a more detailed description: 1) Here are my build options

# Build python interpreter_wrapper.
cd "${BUILD_DIR}"
case "${TENSORFLOW_TARGET}" in
  armhf)
    BAZEL_FLAGS="--config=elinux_armhf
      --copt=-march=armv7-a --copt=-mfpu=neon-vfpv4
      --copt=-O3 --copt=-fno-tree-pre --copt=-fpermissive
      --define tensorflow_mkldnn_contraction_kernel=0
      --define=raspberry_pi_with_neon=true
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true"
    ;;
  aarch64)
    BAZEL_FLAGS="--config=elinux_aarch64
      --define tensorflow_mkldnn_contraction_kernel=0
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true
      --copt=-O3"
    ;;
  native)
    BAZEL_FLAGS="--copt=-O3 --copt=-march=native
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true"
    ;;
  *)
    BAZEL_FLAGS="--copt=-O3
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true"      
    ;;
esac

Then compile with

sudo CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3.7 -e CROSSTOOL_PYTHON_INCLUDE_PATH=/usr/include/python3.7" \
  tensorflow/tools/ci_build/ci_build.sh PI-PYTHON37 \
  tensorflow/lite/tools/pip_package/build_pip_package_with_bazel.sh aarch64

You can find the resulting wheel here if you'd like to check: https://drive.google.com/drive/folders/14LKO_hbc4VeTri8k25zhRJleRc7cLF_T?usp=sharing

2) There is INFO: Created TensorFlow Lite XNNPACK delegate for CPU. line, so I'd assume yes it is enabled. 3) I double checked on that to make sure signed activations are used. You can have a quick look at the model in the google drive folder. Basically this is what I use for conversion:

            converter = tf.lite.TFLiteConverter.from_keras_model(model)
            converter.optimizations = [tf.lite.Optimize.DEFAULT]            
            converter.representative_dataset = self.edgetpu_dataset_gen

which should give an int8 quantized model with float inputs and outputs, correct? I'm using tensorflow 2.5 installed from pip, perhaps that could be the source of the problem? 4) Absolutely certain it is ARM64.

Perhaps there is a certain commit and/or Tensorflow version combination you would recommend. that you know for sure works?

P.S. Something interesting, but perhaps unrelated - after I switched to tf 2.5 and set inputs/outputs to be quantized as well with

            converter = tf.lite.TFLiteConverter.from_keras_model(model)
            converter.optimizations = [tf.lite.Optimize.DEFAULT]
            converter.representative_dataset = self.edgetpu_dataset_gen
            converter.target_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
            converter.inference_input_type = tf.int8
            converter.inference_output_type = tf.int8

and ran the resulting model INFO: Created TensorFlow Lite XNNPACK delegate for CPU. has disappeared.

Maratyszcza commented 3 years ago

@AIWintermuteAI I tried the model, and get the following error: "Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors.". It seems that the delegate isn't being used due to dynamic-sized tensors in the model.

AIWintermuteAI commented 3 years ago

@Maratyszcza Okay, that gets us somewhere. After reading your comment, I added the following parameter during tflite model conversion,

model.input.set_shape(1 + model.input.shape[1:])

which should produce the model with static shapes. It didn't make the difference for inference time with my benchmark script however.

How can I make sure that delegate is being used? If I perform inference with Tensorflow Lite it only says INFO: Created TensorFlow Lite XNNPACK delegate for CPU. and nothing else.

Additionally I have tried benchmark tool from https://www.tensorflow.org/lite/performance/measurement#native_benchmark_binary

It says latest nightly version, but it seems the XNNPACK delegate there doesn't support INT8 inference. I get the following output:

pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ ./linux_aarch64_benchmark_model --graph=yolo_best_recall_int8_static.tflite --use_xnnpack=true --enable_op_profiling=true --num_threads=3
STARTING!
Log parameter values verbosely: [0]
Num threads: [3]
Graph: [yolo_best_recall_int8_static.tflite]
Enable op profiling: [1]
#threads used for CPU inference: [3]
Use xnnpack: [1]
Loaded model yolo_best_recall_int8_static.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
XNNPACK delegate created.
Though XNNPACK delegate is explicitly applied, the model graph will not be executed by the delegate.

and then it just defaults to vanilla tensorflow lite. You can find the model with static shapes in "third post" folder in shared Google Drive folder for our discussion: https://drive.google.com/drive/folders/1-jr0vWANCgWN1dHJ0UBfmxcUeygKu1Q2?usp=sharing

Maratyszcza commented 3 years ago

@AIWintermuteAI I tried the static model, and it does run on XNNPACK. Make sure you build with --define xnn_enable_qs8=true. Here's what I see on Pixel 2: TFLite without XNNPACK: 129 ms TFLite with XNNPACK: 118 ms

Maratyszcza commented 3 years ago

The speedup could be better if XNNPACK supported quantized PAD operator (which also can be fused into convolutions). Currently PAD falls back to TFLite implementation which is responsible for 7% of runtime on Pixel 2.

AIWintermuteAI commented 3 years ago

@Maratyszcza hi there, thanks for checking! I still get the same time when running both single-thread and 4-thread inference on Raspberry Pi 4 with that model and XNNPACK. But perhaps the speed up is existent, but not noticeable? - upon closer look at XNNPACK benchmarks, Raspberry Pi 4 shows comparably small inference acceleration, compared to other platforms. source: https://blog.tensorflow.org/2020/07/accelerating-tensorflow-lite-xnnpack-integration.html

There are no exact numbers, but from the look at the graph it seems it's 30-40% improvement for MobileNetv2 for Raspberry Pi 4 and close to 100% to for Pixel 2. Then, considering you had just 9% speed-up in the test Pixel 2 for my model, on Raspberry Pi it might be closer to 3-4 % and probably being offset by other factors, so I cannot see it in the benchmark.

What do you think is the main reason for Raspberry Pi 4 inference speed-up being the lowest among comparable platforms?

(tflite-vanilla) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ python3 detector_file.py --model yolo_best_recall_int8_static.tflite --labels labels.txt --file ~/raspberry_pi/samples/test_s.mp4 
OpenCV version: 4.5.2
Processing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:30<00:00,  4.68it/s]
Finished processing frames
Average time(ms):  191.0
FPS:  5.2356020942408374
(tflite-vanilla) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ deactivatepi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ source ~/tflite-new/bin/activate(tflite-new) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ python3 detector_file.py --model yolo_best_recall_int8_static.tflite --labels labels.txt --file ~/raspberry_pi/samples/test_s.mp4 
OpenCV version: 4.5.2
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Processing frames:   0%|                                                                                                                                           | 0/141 [00:00<?, ?it/s]/home/pi/aXeleRate/example_scripts/tensorflow_lite/detector/cv_utils.py:349: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return np.array(temp_list)
Processing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:30<00:00,  4.64it/s]
Finished processing frames
Average time(ms):  191.0
FPS:  5.2356020942408374
(tflite-new) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ python3 detector_file.py --model yolo_best_recall_int8_static.tflite --labels labels.txt --file ~/raspberry_pi/samples/test_s.mp4 
OpenCV version: 4.5.2
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Processing frames:   0%|                                                                                                                                           | 0/141 [00:00<?, ?it/s]/home/pi/aXeleRate/example_scripts/tensorflow_lite/detector/cv_utils.py:349: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return np.array(temp_list)
Processing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:14<00:00,  9.80it/s]
Finished processing frames
Average time(ms):  76.0
FPS:  13.157894736842104
(tflite-new) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ deactivatepi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ source ~/tflite-vanilla/bin/activate(tflite-vanilla) pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ python3 detector_file.py --model yolo_best_recall_int8_static.tflite --labels labels.txt --file ~/raspberry_pi/samples/test_s.mp4 
OpenCV version: 4.5.2
Processing frames: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141/141 [00:13<00:00, 10.24it/s]
Finished processing frames
Average time(ms):  74.0
FPS:  13.513513513513514

Maratyszcza commented 3 years ago

I got an RPi 4 with ARM64 Raspbian, and here's what I'm seeing with //tensorflow/lite/tools/benchmark:benchmark_model benchmark on your last model:

Single-threaded:

Without XNNPACK: 196946 us
With XNNPACK: 173516 us (13% faster)

Using 4 threads:

Without XNNPACK: 69114.1 us
With XNNPACK: 53111.9 us (30% faster)

Maratyszcza commented 3 years ago

Also, XNNPACK delegate got support for quantized PAD operator in tensorflow/tensorflow@75b113b4fbd4944c7a75bca7cadab9c14bc1a958, which should've helped with performance.

AIWintermuteAI commented 3 years ago

@Maratyszcza that's great news! I compiled this commit https://github.com/tensorflow/tensorflow/commit/75b113b4fbd4944c7a75bca7cadab9c14bc1a958 and still not getting inference speed increase... Perhaps I'm building the tflite pip package wrong?

I'm downloading tensorflow repository, checking out the commit, changing the bazel build parameters in build_pip_package_with_bazel.sh to

# Build python interpreter_wrapper.
cd "${BUILD_DIR}"
case "${TENSORFLOW_TARGET}" in
  armhf)
    BAZEL_FLAGS="--config=elinux_armhf
      --copt=-march=armv7-a --copt=-mfpu=neon-vfpv4
      --copt=-O3 --copt=-fno-tree-pre --copt=-fpermissive
      --define tensorflow_mkldnn_contraction_kernel=0
      --define=raspberry_pi_with_neon=true
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true"
    ;;
  aarch64)
    BAZEL_FLAGS="--config=elinux_aarch64
      --define tensorflow_mkldnn_contraction_kernel=0
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true
      --copt=-O3"
    ;;
  native)
    BAZEL_FLAGS="--copt=-O3 --copt=-march=native
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true"
    ;;
  *)
    BAZEL_FLAGS="--copt=-O3
      --define=tflite_pip_with_flex=true
      --define=tflite_with_xnnpack=true
      --define=xnn_enable_qs8=true"      
    ;;
esac

and then build with

sudo CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3.7 -e CROSSTOOL_PYTHON_INCLUDE_PATH=/usr/include/python3.7"   tensorflow/tools/ci_build/ci_build.sh PI-PYTHON37   tensorflow/lite/tools/pip_package/build_pip_package_with_bazel.sh aarch64

I can attach the resulting wheel if necessary. Are there parameters I can enable on BUILD that would allow me to see if model is executed with XNNPACK? As of now it only says INFO: Created TensorFlow Lite XNNPACK delegate for CPU. which means that TFLite was built with XNNPACK, but as I've learned doesn't mean that model graph is executed with XNNPACK.

Maratyszcza commented 3 years ago

@AIWintermuteAI You're probably misssing -c opt in Bazel flags

AIWintermuteAI commented 3 years ago

@Maratyszcza I'm using the bazel compilation script from here https://github.com/tensorflow/tensorflow/blob/2a1900bf09acea61a79278dd633f71ec5a268559/tensorflow/lite/tools/pip_package/build_pip_package_with_bazel.sh#L97 It is marked as Alternative build with Bazel (experimental) in README https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/pip_package#alternative-build-with-bazel-experimental

-c opt flag does seem to be present... Since you're actively developing package you're probably not using the same approach for building the pip package as me, so perhaps there is a bug/error somewhere along that way. How do you build the package? Alternatively, if you have the time, you can try building with Bazel according to this instruction yourself, to see if you can reproduce the problem (pip package built does seem to use XNNPACK by default, but it doesn't include QS8 optimization).

Maratyszcza commented 3 years ago

@AIWintermuteAI There was a bug in TensorFlow Lite that bypassed implicitly applying the delegate unless the model has floating-point tensors, and it looks like you were hit by it. Fixed in tensorflow/tensorflow@2509a6e82e6eb95888de697539845089923c23d5

AIWintermuteAI commented 3 years ago

Hi @Maratyszcza ! Good news! I was able to reproduce the results with benchmark_model tool built from fdfd1e09894e082e13314dffc9d36990524ac3f1

Here are test results: with XNNPACK 55744.2 us

pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ ./benchmark_model --graph=yolo_best_recall_int8_static.tflite --use_xnnpack=true --num_threads=4STARTING!
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [yolo_best_recall_int8_static.tflite]
#threads used for CPU inference: [4]
Use xnnpack: [1]
Loaded model yolo_best_recall_int8_static.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
XNNPACK delegate created.
Explicitly applied XNNPACK delegate, and the model graph will be partially executed by the delegate w/ 1 delegate kernels.
The input model file size (MB): 3.67827
Initialized session in 49.438ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=10 first=65683 curr=53045 min=52869 max=65683 avg=54366.2 std=3778

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=54830 curr=53318 min=52522 max=113361 avg=55744.2 std=9900

Inference timings in us: Init: 49438, First inference: 65683, Warmup (avg): 54366.2, Inference (avg): 55744.2
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=9.79688 overall=18.6016

and without XNNPACK 73528.5 us

pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ ./benchmark_model --graph=yolo_best_recall_int8.tflite --num_threads=4
STARTING!
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [yolo_best_recall_int8.tflite]
#threads used for CPU inference: [4]
Loaded model yolo_best_recall_int8.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
ERROR: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#99 is a dynamic-sized tensor).
ERROR: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.
The input model file size (MB): 3.68005
Initialized session in 9.273ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=7 first=95914 curr=72995 min=72995 max=95914 avg=78144.7 std=8175

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=74229 curr=70705 min=70705 max=78724 avg=73528.5 std=1034

Inference timings in us: Init: 9273, First inference: 95914, Warmup (avg): 78144.7, Inference (avg): 73528.5
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=3.28125 overall=13.2383

Two things worth taking note: 1) XNNPACK seems to be forcibly applied to inference for INT8 models. If I run inference on the same model (yolo_best_recall_int8_static.tflite) without --use_xnnpack=true according to inference results XNNPACK is still applied to graph, check the output here

pi@raspberrypi:~/aXeleRate/example_scripts/tensorflow_lite/detector $ ./benchmark_model --graph=yolo_best_recall_int8_static.tflite  --enable_op_profiling=true --num_threads=4STARTING!
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [yolo_best_recall_int8_static.tflite]
Enable op profiling: [1]
#threads used for CPU inference: [4]
Loaded model yolo_best_recall_int8_static.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
The input model file size (MB): 3.67827
Initialized session in 18.777ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=9 first=64791 curr=56295 min=54873 max=65831 avg=57758.2 std=4065

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=58154 curr=53809 min=53077 max=60727 avg=55043.2 std=1562

Inference timings in us: Init: 18777, First inference: 64791, Warmup (avg): 57758.2, Inference (avg): 55043.2
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=8.97656 overall=18.7227
Profiling Info for Benchmark Initialization:
============================== Run Order ==============================
                     [node type]                  [start]         [first]        [avg ms]            [%]          [cdf%]          [mem KB]      [times called]  [Name]
         ModifyGraphWithDelegate                    0.000          15.572          15.572        99.502%         99.502%          5052.000              1       ModifyGraphWithDelegate/0
                 AllocateTensors                   15.535           0.073           0.039         0.498%        100.000%             0.000              2       AllocateTensors/0

============================== Top by Computation Time ==============================
                     [node type]                  [start]         [first]        [avg ms]            [%]          [cdf%]          [mem KB]      [times called]  [Name]
         ModifyGraphWithDelegate                    0.000          15.572          15.572        99.502%         99.502%          5052.000              1       ModifyGraphWithDelegate/0
                 AllocateTensors                   15.535           0.073           0.039         0.498%        100.000%             0.000              2       AllocateTensors/0

Number of nodes executed: 2
============================== Summary by node type ==============================
                     [Node type]          [count]         [avg ms]          [avg %]         [cdf %]       [mem KB]      [times called]
         ModifyGraphWithDelegate                1           15.572          99.502%         99.502%       5052.000              1
                 AllocateTensors                1            0.078           0.498%        100.000%          0.000              2

Timings (microseconds): count=1 curr=15650
Memory (bytes): count=0
2 nodes observed

Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
                     [node type]                  [start]         [first]        [avg ms]            [%]          [cdf%]          [mem KB]      [times called]  [Name]
                        QUANTIZE                    0.018           0.519           0.463         0.842%          0.842%             0.000              1       [input_1_int8]:0
           TfLiteXNNPackDelegate                    0.482          57.562          54.507        99.104%         99.946%             0.000              1       [yolo/detection_layer_1/BiasAdd;yolo/detection_layer_1/Conv2D;yolo/detection_layer_1/BiasAdd/ReadVariableOp/resource1]:36
                         RESHAPE                   54.991           0.008           0.009         0.016%         99.962%             0.000              1       [Identity_int8]:34
                      DEQUANTIZE                   55.000           0.017           0.021         0.038%        100.000%             0.000              1       [Identity]:35

============================== Top by Computation Time ==============================
                     [node type]                  [start]         [first]        [avg ms]            [%]          [cdf%]          [mem KB]      [times called]  [Name]
           TfLiteXNNPackDelegate                    0.482          57.562          54.507        99.104%         99.104%             0.000              1       [yolo/detection_layer_1/BiasAdd;yolo/detection_layer_1/Conv2D;yolo/detection_layer_1/BiasAdd/ReadVariableOp/resource1]:36
                        QUANTIZE                    0.018           0.519           0.463         0.842%         99.946%             0.000              1       [input_1_int8]:0
                      DEQUANTIZE                   55.000           0.017           0.021         0.038%         99.984%             0.000              1       [Identity]:35
                         RESHAPE                   54.991           0.008           0.009         0.016%        100.000%             0.000              1       [Identity_int8]:34

Number of nodes executed: 4
============================== Summary by node type ==============================
                     [Node type]          [count]         [avg ms]          [avg %]         [cdf %]       [mem KB]      [times called]
           TfLiteXNNPackDelegate                1           54.507          99.107%         99.107%          0.000              1
                        QUANTIZE                1            0.463           0.842%         99.949%          0.000              1
                      DEQUANTIZE                1            0.020           0.036%         99.985%          0.000              1
                         RESHAPE                1            0.008           0.015%        100.000%          0.000              1

Timings (microseconds): count=50 first=58106 curr=53767 min=53034 max=60684 avg=55000.1 std=1562
Memory (bytes): co

2) Inference with TFlite interpreter still doesn't show any improvements... Not sure why is that, if I'm not able to rectify the issue myself, I'll create a separate issue.

Keep on doing great job with XNNPACK!

bernalourodri commented 2 years ago

I can't run a pose estimation mediapipe script that I made on a jetson nano. It seems to me that I can't activate xnnpack. Does anyone have any idea how I can resolve this? Thanks in advance