Are you willing to contribute it (Yes/No): No, this has a broader impact in the TFLite conversion & quantization API design.
Describe the feature and the current behavior/state.
TensorFlow 2.7 is finally adding support for multiple signatures in TFLite models as described here. However, this new support is lacking proper integration with model quantization in general, and with int8 quantization in particular.
Currently, full integer quantization is achieved by selecting a specific set of selected ops in TFLiteConverter, as described here. This also allows selecting the dtype of inputs and outputs through the inference_input_type and inference_output_type attributes, which can be float for automatic quantization/dequantization, or accept/produce quantized values directly.
However, when providing multiple signatures this quantization still applies only to one of them. In particular, I found that when using TFLiteConverter.from_concrete_functions, the quantized one is not even the first one provided in the list of signatures, but the first one alphabetically by its function name. This is certainly not a good design.
Furthermore, the function selected for quantization is limited to take a single input, which is the one inference_input_type applies to. I haven't checked if the same happens with outputs, but I wouldn't be surprised if it does.
Will this change the current api? How?
The API needs to undergo a redesign or provide new APIs to be coherent with support for multiple signatures. In particular:
Whether to apply quantization, and which kind of quantization and its settings, is something that should be decided at signature level instead of model level. Function names should be irrelevant to this.
Quantization should not be limited to single input functions. Rather, it should be possible to quantize or dequantize any desired set of inputs or outputs in any signature. Since indicating this per argument (and especially for outputs) might become quite cumbersome, the simplest approach might be to provide a way to easily quantize and dequantize values at will within the signatures themselves. This might require providing representative samples for all quantized signatures (except if using quantization-aware training, in which case the model should already have quantization ranges to use).
Bonus for extra points: allow a signature to produce quantized outputs that will be used as inputs in another signature. Or to say in another way, allow using the same set of quantization parameters for different quantized/dequantized values across signatures. For example, imagine an autoencoder where the encoder quantizes only its outputs for storage (not its calculation or weights), and a decoder takes already quantized inputs and works using int8 quantization.
Who will benefit with this feature?
Anyone using TFLite model conversion with multiple signatures, or even just one if quantizing a function that takes more than a single quantized argument.
Any Other info.
Partially tangential to this, I've noticed that the quantization parameters used by quantization-aware training do not seem to quite match the ones produced when producing TFLite models. This might be a potential source of issues as models are trained for a quantization different to what's applied later in practice. Maybe TFLite quantization should enforce keeping the quanization parameters of any annotated layers using fake quantization? If so, representative input data should not be needed in this case.
System information
Describe the feature and the current behavior/state.
TensorFlow 2.7 is finally adding support for multiple signatures in TFLite models as described here. However, this new support is lacking proper integration with model quantization in general, and with int8 quantization in particular.
Currently, full integer quantization is achieved by selecting a specific set of selected ops in TFLiteConverter, as described here. This also allows selecting the dtype of inputs and outputs through the
inference_input_type
andinference_output_type
attributes, which can be float for automatic quantization/dequantization, or accept/produce quantized values directly.However, when providing multiple signatures this quantization still applies only to one of them. In particular, I found that when using TFLiteConverter.from_concrete_functions, the quantized one is not even the first one provided in the list of signatures, but the first one alphabetically by its function name. This is certainly not a good design.
Furthermore, the function selected for quantization is limited to take a single input, which is the one
inference_input_type
applies to. I haven't checked if the same happens with outputs, but I wouldn't be surprised if it does.Will this change the current api? How?
The API needs to undergo a redesign or provide new APIs to be coherent with support for multiple signatures. In particular:
Whether to apply quantization, and which kind of quantization and its settings, is something that should be decided at signature level instead of model level. Function names should be irrelevant to this.
Quantization should not be limited to single input functions. Rather, it should be possible to quantize or dequantize any desired set of inputs or outputs in any signature. Since indicating this per argument (and especially for outputs) might become quite cumbersome, the simplest approach might be to provide a way to easily quantize and dequantize values at will within the signatures themselves. This might require providing representative samples for all quantized signatures (except if using quantization-aware training, in which case the model should already have quantization ranges to use).
Bonus for extra points: allow a signature to produce quantized outputs that will be used as inputs in another signature. Or to say in another way, allow using the same set of quantization parameters for different quantized/dequantized values across signatures. For example, imagine an autoencoder where the encoder quantizes only its outputs for storage (not its calculation or weights), and a decoder takes already quantized inputs and works using int8 quantization.
Who will benefit with this feature?
Anyone using TFLite model conversion with multiple signatures, or even just one if quantizing a function that takes more than a single quantized argument.
Any Other info.
Partially tangential to this, I've noticed that the quantization parameters used by quantization-aware training do not seem to quite match the ones produced when producing TFLite models. This might be a potential source of issues as models are trained for a quantization different to what's applied later in practice. Maybe TFLite quantization should enforce keeping the quanization parameters of any annotated layers using fake quantization? If so, representative input data should not be needed in this case.