Failed with Post Training Quantization after Quantization Aware Training

tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.

https://www.tensorflow.org/model_optimization

Apache License 2.0

1.49k stars 319 forks source link

Failed with Post Training Quantization after Quantization Aware Training #408

Closed kalaluthien closed 4 years ago

kalaluthien commented 4 years ago

Describe the requests I am working with recent neural networks targeting mobile devices, and I found there are obstacles to perform integer-quantization after QAT.

I know these APIs are not available now, but if you have plans to address following issues, please let me know when they will be available :)

AveragePooling2D

x = layers.Conv2D(32, 5, padding='same', activation='relu')(input)
x = layers.AveragePooling2D((2, 2), (2, 2), padding='same')(x)  #<- succeed to convert, failed to prepare
x = layers.Conv2D(64, 5, padding='same', activation='relu')(x)

tensorflow/lite/kernels/pooling.cc:94 input->params.scale != output->params.scale (-1045139600 != 653455232) Node number 2 (AVERAGE_POOL_2D) failed to prepare.

Same with MaxPooling2D problem.

MaxPooling2D

x = layers.Conv2D(32, 5, padding='same', activation='relu')(input)
x = layers.MaxPooling2D((2, 2), (2, 2), padding='same')(x)  #<- succeed to convert, failed to prepare
x = layers.Conv2D(64, 5, padding='same', activation='relu')(x)

tensorflow/lite/kernels/pooling.cc:94 input->params.scale != output->params.scale (-1045139600 != 653454832) Node number 2 (MAX_POOL_2D) failed to prepare.

Same with AveragePooling2D problem.

Residual connection
```
input = tf.keras.Input(input_shape)
shortcut = input
x = layers.Conv2D(16, 1, padding='same', use_bias=False)(input)
x = layers.BatchNormalization()(x)
x = layers.ReLU(6.0)(x)
x = x + shortcut  #<- failed to convert addition because '+' reduced to TensorFlowOpLayer, not Add.
```
Layer tf_op_layer_AddV2:<class 'tensorflow.python.keras.engine.base_layer.TensorFlowOpLayer'> is not supported. You can quantize this layer by passing a tfmot.quantization.keras.QuantizeConfig instance to the quantize_annotate_layer API.
- This problem cause below failure.
HardSwish
```
x = layers.Conv2D(32, 3, 2, padding='same', use_bias=False)(input)
x = layers.BatchNormalization()(x)
x = layers.ReLU(6.0)(x + 3) * (1 / 6)  #<- equivalent to `HardSwish`
```
Layer tf_op_layer_AddV2_1:<class 'tensorflow.python.keras.engine.base_layer.TensorFlowOpLayer'> is not supported. You can quantize this layer by passing a tfmot.quantization.keras.QuantizeConfig instance to the quantize_annotate_layer API.
- There are two levels of the problem.
- I had configured QuantizeConfig to support TensorFlowOpLayer to use Add and Multiply ops, however these ops are placed between BN and ReLU6, Conv2D-BN-ReLU layers could not be fused correctly. -> Quantized MobileNetV3 became slower than floating pointer version on the android device.
- Main building block of MobileNetV3: Conv2D-BN-HardSwish is not supported pattern.
GlobalAveragePooling-Dense
```
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(1024, activation='relu')(x)  #<- succeed to convert, failed to prepare
```
tensorflow/lite/kernels/kernel_util.cc:129 std::abs(input_product_scale - bias_scale) <= 1e-6 * std::min(input_product_scale, bias_scale) was not true. Node number 4 (FULLY_CONNECTED) failed to prepare.
- This bug prevent me from benchmark official MobileNetV2 network imported from tf.keras.

System information

TensorFlow installed from (source or binary): binary

TensorFlow version: 2.2.0 (release)

TensorFlow Model Optimization version: 0.3.0 (release)

Python version: 3.6.0

Code to reproduce the issue Gist to reproduce full test https://gist.github.com/kalaluthien/b270c71afb6866ae61ef0dc088a762f2

kalaluthien commented 4 years ago

Can you check MobilenetV2 quantization-aware training followed by post-training integer quantation too? I had failed to reproduce results in official guide and overview using tf2.2.0 and tfmot 0.3.0. (Are they measured using TF1.x?)

It would be great if you give experiment settings to reproduce speedup & accuracy comparison between Float & Quantized MobileNetV2! (Or, the experiments are not available with current TF2.x?)

Notebook: https://gist.github.com/kalaluthien/c44da9bb6d027fbca95a144e07179667#file-mobilenetv2_cifar10-ipynb

Summary:

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.experimental_new_converter = True
converter.experimental_new_quantizer = True

dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(1)
representative_dataset_gen = get_representative_dataset(dataset, num_calibration_steps=100)
converter.representative_dataset = tf.lite.RepresentativeDataset(representative_dataset_gen)

quantized_tflite_model = converter.convert() 
...
interpreter.invoke()  # RuntimeError: Quantization not yet supported for op: DEQUANTIZE

nutsiepully commented 4 years ago

Hi @kalaluthien,

Thanks for the well thought out and detailed bug report. Sorry for the delay in getting back - there's generally limited time I take out each week to look at github issues :)

AveragePooling2D
MaxPooling2D

I tried out both these example and they converted+ran just fine for me. Perhaps, there is a flag you are missing during conversion. Check this file for conversion code.

Looking at your colab code, converter._experimental_new_quantizer = True is missing. Please try that and let me know how it goes.

TensorflowOpLayer

This failure is expected. By default, our goal is to support built-in keras layers which is basically layers under the tf.keras.layers module. TensorflowOpLayer can be used to wrap any TF op, and it's not feasible to meaningfully cover any tf op.

The recommended approach here is to use built-in Keras layers to achieve this. So you can use tf.keras.layers.Add and tf.keras.layers.Reshape instead of using + and expand_dims. That should solve it.

If you really do want to use something else, it's the user's responsibility to provide an appropriate QuantizeConfig for your use.

HardSwish

This is again the same problem as TensorFlowOpLayer. And yes, you are right the existing pattern only matches Conv+BN+ReLU. The code likely became slow since it had added a bunch of Quant/Dequant ops in between. I don't think the converter is likely to match Conv/BN/(Add+Mul ops matching hardswish) either while folding.

The proper fix here would be to add support for hardswish. Can you please file a separate bug for that requesting HardSwish support. I'll take some time out to add it. We covered MobileNet v1/v2, so this is currently missing.

But we should be able to add support for this. We also need to ensure the converter is handling it properly.

GlobalAveragePooling-Dense

Again, this works for me. That's how we got MobileNetV2 working and created the results. Perhaps, this is the same issue as Averge/MaxPooling. Please try the fix out and let me know if that works.

nutsiepully commented 4 years ago

Regarding MobileNetV2 reproduction, looking at your code it seems you are training on CIFAR. It won't be as straight-forward to reproduce the full training.

We trained a Keras MobileNet V2 model with hyperparams from this. We then quantized the model and trained again for a few epochs.

I think the reason your conversion code is failing is due to

converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

Try removing it, and I think conversion should work. If it doesn't, please let me know. Basically, the QAT conversion by default uses Float inputs/outputs based on the model signature. There is work in progress in TFLiteConverterV2 to support a different model interface int8/uint8 etc.

See this.

Hope this helps.

nutsiepully commented 4 years ago

Also, regarding HardSwish if you have the time and are interested, I'm happy to guide you in how to implement support for it :)

kalaluthien commented 4 years ago

Thanks for great replies!

I'll test them again and report the result here!
I am glad to contribute to implement support for HARD_SWISH ops which is our concern!

kalaluthien commented 4 years ago

Hi, @nutsiepully. I've tested above questions on same environment. (tf==2.2.0 + tfmot==0.3.0)

I think converter._experimental_new_quantizer = True is already in my code, so there will be another reason. gist link

Code snippet:

for model in models:
  print(f'Convert "{model.name}"')
  converter = tf.lite.TFLiteConverter.from_keras_model(model)
  converter.optimizations = [tf.lite.Optimize.DEFAULT]  # for post-training quantization
  converter.representative_dataset = calibration_gen  # for full-integer quantization
  converter._experimental_new_quantizer = True  # already here!

  with quantize.quantize_scope():  # is this right place for opening quantize_scope?
      tflite_model = converter.convert()
      tflite_models.append((tflite_model, model.name))
...
for model, name in tflite_models:
  try:
    interpreter = tf.lite.Interpreter(model_content=model)
    interpreter.allocate_tensors()

Error messages:

[TFLite] MnistAveragePooling2D error:
tensorflow/lite/kernels/pooling.cc:94 input->params.scale != output->params.scale (-2099980912 != 666249888)Node number 2 (AVERAGE_POOL_2D) failed to prepare.

[TFLite] MnistMaxPooling2D error:
tensorflow/lite/kernels/pooling.cc:94 input->params.scale != output->params.scale (-2099980912 != 666250000)Node number 2 (MAX_POOL_2D) failed to prepare.

[TFLite] MnistDenseAndGAP error:
tensorflow/lite/kernels/kernel_util.cc:129 std::abs(input_product_scale - bias_scale) <= 1e-6 * std::min(input_product_scale, bias_scale) was not true.Node number 4 (FULLY_CONNECTED) failed to prepare.

I fixed colab code and passed DEQUANTIZE error message. Now it breaks at preparation step of fully-connected layer. gist link

Code snippet:

converter.optimizations = [tf.lite.Optimize.DEFAULT]
#converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]  <- commented out
#converter.experimental_new_converter = True  <- commented out because `True` is default
converter._experimental_new_quantizer = True  <- add '_' in front of variable name

dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(1)
representative_dataset_gen = get_representative_dataset(dataset, num_calibration_steps=100)
converter.representative_dataset = tf.lite.RepresentativeDataset(representative_dataset_gen)

quantized_tflite_model = converter.convert() 
...
interpreter = tf.lite.Interpreter(model_content=quantized_tflite_model)
interpreter.allocate_tensors()

Error message:

RuntimeError: tensorflow/lite/kernels/kernel_util.cc:129 std::abs(input_product_scale - bias_scale) <= 1e-6 * std::min(input_product_scale, bias_scale) was not true.
Node number 69 (FULLY_CONNECTED) failed to prepare.

This error is same error with Q1, which seems to bug on GAP+Dense combination.

keras Add works! But for the addition, I used tf.add() and tf.multiply() because I need to add/multiply constant values element-wise to dynamic vector which has None-valued batch dimension. Is there another workaround to add constant values without using tf.add() and tf.lambda()? In short:
```
x = tf.keras.Input(...)
shortcut = x
x = tf.keras.layer.Relu(6.0)(x + 3.0) * 1.67  # any equivalent options when using only keras builtins?
x = tf.keras.layer.Multiply([x, shortcut])
```
I'll be great if I can contribute to support h-swish, and then we can benchmark MobileNetV3.

Thanks!

nutsiepully commented 4 years ago

Oh I'm sorry I made a mistake. I meant use

converter.experimental_new_converter = True

That's what was missing.

nutsiepully commented 4 years ago

Can you try using tf-nightly do run your code? There might be some converter changes that aren't in TF 2.2.

I've successfully converted Dense+BN etc. I think if you use tf-nightly with experimental_new_converter=True, the conversion errors should go away.

nutsiepully commented 4 years ago

As for HardSwish, I just looked into it a bit. There seem to be a few tricky pieces.

For starters, hard_swish has not been added as an activation in Keras yet. The goal of the tfmot library is to provide default behavior for all built-in Keras layers/activations. But since hard_swish is not a built-in activation yet, we can't really add a pattern matching it in the library code. It would need to be handled by the user.

I would recommend adding support for it in your code to begin with. Once hard_swish gets added, we can move this code internally. You should be able to file a bug on keras/tf to check whether they plan to add support for it.

You can create a class HardSwish(Layer) which gets added after Conv + BN. You should be able to use built-in Add and Multiply to do so.

Next, to understand exactly what support needs to be added, we would need to understand how it executes in TFLite.

I created a simple model.

inp = tf.keras.Input(shape=(28, 28, 1))
x = tf.keras.layers.Conv2D(32, 5)(inp)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layer.Relu(6.0)(x + 3.0) * 1.67
m = tf.keras.Model(inp, x)

m.save('hswish.h5')

Converted it using the following code.

conv = tf.lite.TFLiteConverter.from_keras_model(m)
def representative_dataset_gen():
  for _ in range(num_calibration_steps):
     yield np.random.rand(28, 28, 1)

conv.representative_dataset = representative_dataset_gen
conv.convert()

# saved as hswish.tflite

Screen Shot 2020-06-23 at 1 36 45 PM

As you can see the converter fuses the Add into the bias, but the Mul comes after.

So you'll need to place the FakeQuant after the Add + ReLU but before the Mul. And likely use a transform similar to this. That should sort the issue out.

kalaluthien commented 4 years ago

Thanks for your advises.

tf-nightly (2.3.0) solves every problem! (except input/output type after conversion: https://github.com/tensorflow/tensorflow/issues/38285, which is independent of these)
You mean I need to place the FakeQuant like: Conv2D-BN-[FakeQuant]-Add-ReLU-Mul-[FakeQuant]-OtherLayers with customized transform using tfmot, and let TfLiteConverter to convert { Conv2D-BN: FusedConv2D, Add-ReLU-Mul: HardSwish } ?

nutsiepully commented 4 years ago

Glad to know.
It will be Conv2D -> BN -> Add -> ReLU -> [FakeQuant] -> Mul. So the converter will fuse the BN + Add + ReLU into the Conv and the Mul will remain separate since it's after thee ReLU.

nutsiepully commented 4 years ago

Seems like this bug is solved. I'm closing it, please feel free to reopen.

You can start a new issue for the hardswish and we can continue our conversation there. Even if it's done in your code, it can remain an example for others to follow.

And it'll be pretty easy to incorporate into the library once you've implemented it. We can try and get HardSwish moved into Keras.

Thanks @kalaluthien for your patience and proactive use of the library.

yfthu commented 4 years ago

Hi, I have meet the same problems as this issue, but I still could not solve them. My environment is tf-nightly 2.3.0.dev20200522 and python3.8

x = layers.ReLU(6.0)(x + 3) * (1 / 6) when I want to quantize the op '+' and '*', it fails and show this: RuntimeError: Layer tf_op_layer_AddV2:<class 'tensorflow.python.keras.engine.base_layer.TensorFlowOpLayer'> is not supported. You can quantize this layer by passing atfmot.quantization.keras.QuantizeConfiginstance to thequantize_annotate_layerAPI.

If I give the QuantizeConfig, maybe I could solve it. However, I have read the guide in https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide, I still don't know how to write the QuantizeConfig for tf.add and tf.multiply. I hope you could tell me how to write the QuantizeConfig for tf.add and tf.multiply or give me some more examples.

Thank you very much!

takakihagen commented 4 years ago

Hello, is there any update? I'm trying to solve this problem too but haven't managed to yet.

Moreover, I'm very confused because hard-swish is defined here as x = tf.keras.layer.Relu(6.0)(x + 3.0) * 1.67. Actually, it is defined as x = x * tf.keras.layer.Relu(6.0)(x + 3.0) * 1.67 in this paper. Therefore, I think the converter does not fuse Add into the bias as @nutsiepully statet. Please see the code and pictures below.

inp = tf.keras.Input(shape=(28, 28, 1))
x = tf.keras.layers.Conv2D(32, 5)(inp)
x = tf.keras.layers.BatchNormalization()(x)
x = x * tf.keras.layers.ReLU(6.0)(x + 3.0) * 1.67
m = tf.keras.Model(inp, x)

m.save('hswish.h5')

conv = tf.lite.TFLiteConverter.from_keras_model(m)
def representative_dataset_gen():
  for _ in range(num_calibration_steps):
     yield np.random.rand(28, 28, 1)

conv.representative_dataset = representative_dataset_gen
conv.convert()

# saved as hswish.tflite

original

quant

Do I miss something?

FatumR commented 3 years ago

@yfthu You can replace tf.add and tf.multiply with tf.keras.layers.Add and tf.keras.layers.Multiply. But, you still need a config for tf.keras.layers.Multiply, the config below worked for me, please note I did int8 qunatization, probably for some cases you need keep there int32.

class MultQuantizeConfig(tfmot.quantization.keras.QuantizeConfig):
    # Configure how to quantize weights.
    def get_weights_and_quantizers(self, layer):
      return []

    # Configure how to quantize activations.
    def get_activations_and_quantizers(self, layer):
      return []

    def set_quantize_weights(self, layer, quantize_weights):
      pass

    def set_quantize_activations(self, layer, quantize_activations):
      pass

    # Configure how to quantize outputs (may be equivalent to activations).
    def get_output_quantizers(self, layer):
      return [
          tfmot.quantization.keras.quantizers.MovingAverageQuantizer(
              num_bits=8, symmetric=False, narrow_range=False, per_axis=False)]

    def get_config(self):
      return {}

Basically, the tf.keras.layers.Multiply is quite simple, no weights and activations, so you don't need to add any quantizations thehere, but the output may be different depending on the inputs, so this is the only place where quantization should be added. But, you can probably omit output quantization too (just return empty list), if input parameters are already qunatized, in such case, I think, the multiplication operation should happen in the precision of parameters (which is usually int8 or int32).

Also, please note, I'm using: MovingAverageQuantizer which should adjust min/max based on average, you may replace with AllValuesQuantizer and see how it affects precision in your case.