tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
https://www.tensorflow.org/model_optimization
Apache License 2.0
1.49k stars 319 forks source link

Fix the integration of TFLite models with multiple signatures and quantization #843

Open leandro-gracia-gil opened 2 years ago

leandro-gracia-gil commented 2 years ago

System information

Describe the feature and the current behavior/state.

TensorFlow 2.7 is finally adding support for multiple signatures in TFLite models as described here. However, this new support is lacking proper integration with model quantization in general, and with int8 quantization in particular.

Currently, full integer quantization is achieved by selecting a specific set of selected ops in TFLiteConverter, as described here. This also allows selecting the dtype of inputs and outputs through the inference_input_type and inference_output_type attributes, which can be float for automatic quantization/dequantization, or accept/produce quantized values directly.

However, when providing multiple signatures this quantization still applies only to one of them. In particular, I found that when using TFLiteConverter.from_concrete_functions, the quantized one is not even the first one provided in the list of signatures, but the first one alphabetically by its function name. This is certainly not a good design.

Furthermore, the function selected for quantization is limited to take a single input, which is the one inference_input_type applies to. I haven't checked if the same happens with outputs, but I wouldn't be surprised if it does.

Will this change the current api? How?

The API needs to undergo a redesign or provide new APIs to be coherent with support for multiple signatures. In particular:

Who will benefit with this feature?

Anyone using TFLite model conversion with multiple signatures, or even just one if quantizing a function that takes more than a single quantized argument.

Any Other info.

Partially tangential to this, I've noticed that the quantization parameters used by quantization-aware training do not seem to quite match the ones produced when producing TFLite models. This might be a potential source of issues as models are trained for a quantization different to what's applied later in practice. Maybe TFLite quantization should enforce keeping the quanization parameters of any annotated layers using fake quantization? If so, representative input data should not be needed in this case.

mohantym commented 2 years ago

Hi @jvishnuvardhan ! Could you please look at this feature request!