tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
https://www.tensorflow.org/model_optimization
Apache License 2.0
1.48k stars 320 forks source link

tfl.concatenation op quantization parameters violate the same scale constraint #1053

Closed akrapukhin closed 10 months ago

akrapukhin commented 1 year ago

Describe the bug When there is a concat layer in a model, tflite convert() might generate an error like this:

error: 'tfl.concatenation' op quantization parameters violate the same scale constraint: !quant.uniform<i8:f32, 0.046824861040302354:-1> vs. !quant.uniform<i8:f32, 0.0039213619980157594:-128>

However, if I add a superfluous dimension to each input tensor going to concat, and then remove this dimension after concat (see RESHAPE_TRICK below), the model is converted.

System information

TensorFlow version: 2.12.0

TensorFlow Model Optimization version (binary): 0.7.3

Python version: 3.9.16

Describe the expected behavior A model with a concat layer should be quantized without strange reshape tricks.

Describe the current behavior An error occurs when calling convert()

Code to reproduce the issue if RESHAPE_TRICK is false, the code will produce the error. If set to True, the model will be converted successfully.

import tensorflow as tf
import tensorflow_model_optimization as tfmot

class MyQuantizeConfig(tfmot.quantization.keras.QuantizeConfig):
  def get_weights_and_quantizers(self, layer):
    return []

  def get_activations_and_quantizers(self, layer):
    return []

  def set_quantize_weights(self, layer, quantize_weights):
    pass

  def set_quantize_activations(self, layer, quantize_activations):
    pass

  def get_output_quantizers(self, layer):
    return [tfmot.quantization.keras.quantizers.MovingAverageQuantizer(num_bits=8, per_axis=False, symmetric=False, narrow_range=False)]

  def get_config(self):
    return {}

# Settings
RESHAPE_TRICK = False # set to True and error will go away
B = 4
H = 8
W = 8
C = 3

# Model
input_1 = tf.keras.layers.Input(shape=(H, W, C))
input_2 = tf.keras.layers.Input(shape=(1, H*W))

input_1_reshaped = tf.keras.layers.Reshape((H*W, C))(input_1)
branch_1 = tfmot.quantization.keras.quantize_annotate_layer(tf.keras.layers.Dot(axes=(2,1)), MyQuantizeConfig())([input_2, input_1_reshaped])

branch_1 = tfmot.quantization.keras.quantize_annotate_layer(tf.keras.layers.Multiply(), MyQuantizeConfig())([input_1, branch_1])

branch_2 = input_1

if RESHAPE_TRICK:
    branch_1 = tf.keras.layers.Reshape((1 + branch_1.shape[1:]))(branch_1)
    branch_2 = tf.keras.layers.Reshape((1 + branch_2.shape[1:]))(branch_2)

output = tf.keras.layers.concatenate([branch_1, branch_2], axis=-1)

if RESHAPE_TRICK:
    output = tf.keras.layers.Reshape((output.shape[2:]))(output)

model = tf.keras.Model(inputs=[input_1, input_2], outputs=output)

# Training
loss_func = tf.keras.losses.BinaryCrossentropy()
optimizator = tf.keras.optimizers.Adam()

with tfmot.quantization.keras.quantize_scope({'MyQuantizeConfig': MyQuantizeConfig}):
    annotated_model = tfmot.quantization.keras.quantize_annotate_model(model)
    quant_aware_model = tfmot.quantization.keras.quantize_apply(annotated_model)

    for i in range(10):
        with tf.GradientTape() as tape:
            output = quant_aware_model([tf.random.uniform(shape=B + input_1.shape[1:]), tf.random.uniform(shape=B + input_2.shape[1:])], training=True)
            loss_value = loss_func(tf.random.uniform(shape=B + output.shape[1:]), output)

        grads = tape.gradient(loss_value, quant_aware_model.trainable_variables)
        optimizator.apply_gradients(zip(grads, quant_aware_model.trainable_variables))

    # uncomment to make the branch_2 range equal to branch_1 range (doesn't help)
    #quant_aware_model.non_trainable_variables[0].assign(quant_aware_model.non_trainable_variables[11].value())
    #quant_aware_model.non_trainable_variables[1].assign(quant_aware_model.non_trainable_variables[12].value())

    converter = tf.lite.TFLiteConverter.from_keras_model(quant_aware_model)
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    quantized_tflite_model = converter.convert()
    print("converted")

Additional context I also tried to make both inputs to the concat layer to have the same minmax range to force the scales to be identical (see commented code before converter), but it didn't work because the scales are a bit different for some reason:

error: 'tfl.concatenation' op quantization parameters violate the same scale constraint: !quant.uniform<i8:f32, 0.046824861040302354:-1> vs. !quant.uniform<i8:f32, 0.046827164818258847:-1>

Xhark commented 1 year ago

There are many assumptions on TFLite converter for QAT model which reflect to the TFLite quantization op specs here: https://www.tensorflow.org/lite/performance/quantization_spec

For concat op cases, There's a restriction (restriction: Input and outputs must all have same scale/zero_point) on the spec above.

For QAT case, we usually remove all the quantization related thing to input of concat op, and move them to output of concat op. TFLite doesn't supports the case when inputs of concat op have different quantization params.

I'd like to recommend you to move quantization to output side of concat, or add some identity like ops as you did for inputs of concat op explicitly.

tucan9389 commented 10 months ago

@akrapukhin @Xhark cc. @yyoon Hello, I fixed this issue in this commit. Please let me know if you encounter this issue again. Thanks for your report and patience.