CvHadesSun commented 3 years ago

Hi, I am trying to quantize my own tf-keras model for NNAPI delegate. First the origin tf-keras model size is 5.6MB,and quantized int8-tflite model size is about 1.8M, that is not 4X times. And more important is : i am trying to run the quantized int8-tflite model on this test benchmark (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark), the result is : use_nnapi=true inference time is longer than use_nnapi=false(202883us / 45882.2us). And i test the quantized model in tensorflow hub, which inference time are fine on my mobile device,(3x about time reduce.), but I quantize the original SaveModel format model in tensorflow hub according to Tutorials (https://www.tensorflow.org/lite/performance/post_training_quantization). the result is same as my own quantized int8-tflite model. such as moblenet_v2_130_224: (size:21.6/6.3)(time(nnapi true/false):104323 / 42137.5 ). The other model is the same as mine. Meanwhile , I quantized mnist example, finally get the same result(nnapi=true , inference time is longer than nnapi=false.). and I want to quantize the given example of quantized model and original model, but which are not SaveModel format , so I can not load the model to process. So , are there problems in my quantization process?

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution :macos BigSur 11.5.1
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: DM-AI-V1.1(haveing one NPU)
TensorFlow installed from (source or binary):
TensorFlow version (use command below): 2.5
Python version: 3.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

Thx.

teijeong commented 3 years ago

Hi @CvHadesSun ,

Seems mostly working as intended. lmk if I get you wrong.

model size issue - only weights are expected to be reduced by 4x, and as biases are quantized to int32, the size would remain the same for biases. In addition to that, there are model graph information in the model too, so that the resulting size would have some overheads.
NNAPI do support int8, but it depends if your NPU supports int8 inference. Can you confirm that it can run int8 inference via NNAPI? Another possible reason would be due to communication overhead or initialization overhead for NPU, so please make sure your numbers are from multiple inferences (excluding the first one)

CvHadesSun commented 3 years ago

Thx your reply @teijeong : 1.first i understand the model size issue, thank you.

I test serval quantized-int8 model in my NPU with nnapi delegate, which are all slower than without nnapi delegate. But i found the official model in https://www.tensorflow.org/lite/guide/hosted_models, that are unit8-quantized tflite model. So ,I want to quantized my model into uint8 to test the inference time on the NPU with nnapi delegate, but I could not find the doc to do this , it is only int8 tutorials, Could you give me some advice? thx!

teijeong commented 3 years ago

Can you try setting converter.inference_type = tf.uint8 ? (and converter.inference_input_type and converter.inference_output_type if needed)

CvHadesSun commented 3 years ago

Thx reply. 1.Now , I am sure my NPU only support uint8 nnapi delegate, and I try setting converter.inference_type = tf.uint8 and inference_input_type and inference_output_type are tf.uint8, but I quantize all SavedModel format keras model , the model weights are always int8 ,not unit8, and quantization code and log are:

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # or converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS]
converter.inference_type = tf.uint8
converter.inference_input_type = tf.uint8  
converter.inference_output_type = tf.uint8  
tflite_quant_model = converter.convert()

log:

2021-08-05 16:16:53.576560: I tensorflow/cc/saved_model/reader.cc:100] Reading SavedModel from: ../../weights/mnist
2021-08-05 16:16:53.587798: I tensorflow/cc/saved_model/reader.cc:71] Reading meta graph with tags { serve }
2021-08-05 16:16:53.587821: I tensorflow/cc/saved_model/reader.cc:144] Reading SavedModel debug info (if present) from: ../../weights/mnist
2021-08-05 16:16:53.635796: I tensorflow/cc/saved_model/loader.cc:210] Restoring SavedModel bundle.
2021-08-05 16:16:53.995971: I tensorflow/cc/saved_model/loader.cc:194] Running initialization op on SavedModel bundle at path: ../../weights/mnist
2021-08-05 16:16:54.038379: I tensorflow/cc/saved_model/loader.cc:283] SavedModel load for tags { serve }; Status: success: OK. Took 462029 microseconds.
2021-08-05 16:16:54.197794: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
fully_quantize: 0, inference_type: 6, input_inference_type: 3, output_inference_type: 3
WARNING:absl:For model inputs containing unsupported operations which cannot be quantized, the `inference_input_type` attribute will default to the original type.

following scripts, I quantized the example frozen graph (.pb)(https://www.tensorflow.org/lite/guide/hosted_models), and finally, i get the uint8 quantized tflite model, and found which is fine run on my NPU.
```
import tensorflow as tf
```

Convert the model.

converter = tf.compat.v1.lite.TFLiteConverter.from_frozen_graph( graph_def_file='/path/to/mobilenet_v1_1.0_224/frozen_graph.pb', input_arrays=['input'], input_shapes={'input' : [1, 224, 224,3]}, output_arrays=['MobilenetV1/Predictions/Softmax'], ) converter.quantized_input_stats = {'input' : (0., 1.)} # mean, std_dev (input range is [-1, 1]) converter.inference_type = tf.int8 # this is the recommended type.

converter.inference_input_type=tf.uint8 # optional

converter.inference_output_type=tf.uint8 # optional

tflite_model = converter.convert()

Save the model.

with open('quantized_model.tflite', 'wb') as f: f.write(tflite_model)



3.Any other methods to solve my question?:)

yangcheng commented 3 years ago

@teijeong I found there is a note on https://www.tensorflow.org/lite/performance/quantization_spec stating uint8 is for old tools, do you happen to know which version did the change happen and where can we find the old documentation? Thanks

Note: In the past our quantization tooling used per-tensor, asymmetric, uint8 quantization. New tooling, reference kernels, and optimized kernels for 8-bit quantization will use this spec.

MeghnaNatraj commented 3 years ago

In the TF2 converter:

The inference_type flag doesn't exist (it is ignored if the user defines it).
We only support INT8 (or int8) quantization type and we don't support QUANTIZED_UINT8 (or uint8) quantization type. The reason it was removed is listed here: https://github.com/tensorflow/tensorflow/issues/38285#issuecomment-635533037

Is your model trained in TF1? If yes, you can convert and uint8 quantize your TF1 SavedModel model in TF2 as follows:

import tensorflow as tf

converter = tf.compat.v1.lite.TFLiteConverter.from_saved_model(saved_model_dir,
    input_arrays=['input'],
    input_shapes={'input' : [1, 224, 224,3]},
    output_arrays=['MobilenetV1/Predictions/Softmax']
)
converter.quantized_input_stats = {'input' : (0., 1.)}  # mean, std_dev (input range is [-1, 1])
converter.inference_type = tf.int8 # this is the recommended type.
# converter.inference_input_type=tf.uint8 # optional
# converter.inference_output_type=tf.uint8 # optional
tflite_model = converter.convert()

# Save the model.
with open('quantized_model.tflite', 'wb') as f:
  f.write(tflite_model)

CvHadesSun commented 3 years ago

@MeghnaNatraj , I trying your advice, modeling the tf1.13.1 model , and get the uint8 quantized tflite model, thank you. And tf1.x quantization api need converter.default_ranges_stats, if I don't take the Quantization-aware training method, is there another way to find the default_ranges_stats(min and max value)? Thx.

MeghnaNatraj commented 3 years ago

You don’t need the ‘default_range_stats’ flag for quantization. It’s an optional field that we discourage users from using if possible.

Is there anything missing in your model? Are you are looking to modify it further - and in what way?

On Tue, Aug 10, 2021 at 7:52 PM CvHadesSun @.***> wrote:

@MeghnaNatraj https://github.com/MeghnaNatraj , I trying your advice, modeling the tf1.13.1 model , and get the uint8 quantized tflite model, thank you. And tf1.x quantization api need converter.default_ranges_stats, if I don't take the Quantization-aware training method, is there another way to find the default_ranges_stats(min and max value)? Thx.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/model-optimization/issues/775#issuecomment-896459291, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDCGJ3BDBRWOOJYPMW7UILT4HQYHANCNFSM5BQ4WSMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Thank you, Meghna Natraj | Software Engineer | Tensorflow Lite | @.***

tensorflow / model-optimization

Tflite full Integer quantization bug. #775

Convert the model.

converter.inference_input_type=tf.uint8 # optional

converter.inference_output_type=tf.uint8 # optional

Save the model.