tensorflow / model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
https://www.tensorflow.org/model_optimization
Apache License 2.0
1.49k stars 319 forks source link

Low accuracy of TF-Lite model for Mobilenet (Quantization aware training) #368

Closed NobuoTsukamoto closed 4 years ago

NobuoTsukamoto commented 4 years ago

Describe the bug The accuracy of TF-Lite model becomes extremely low after the quantization aware training of tf.keras.applications.mobilenet (v1/v2).

System information

TensorFlow installed from (source or binary): binary

TensorFlow version: tf-nightly-gpu (2.2.0.dev20200420)

TensorFlow Model Optimization version: 0.3.0

Python version: 3.6.9

Describe the expected behavior The accuracy of Keras model (with quantization aware training) and TF-Lite model are almost the same. Image classification with tools

Describe the current behavior

If the model is defined as follows, the accuracy of Keras model and TF-Lite model will be almost the same.

  # extract image features by convolution and max pooling layers
  inputs = tf.keras.Input(shape = (IMG_SIZE, IMG_SIZE, 3))
  x = tf.keras.layers.Conv2D(32, kernel_size=3, padding="same", activation="relu")(inputs)
  x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(x)
  x = tf.keras.layers.Conv2D(64, kernel_size=3, padding="same", activation="relu")(x)
  x = tf.keras.layers.MaxPooling2D(pool_size=(2, 2))(x)
  # classify the class by fully-connected layers
  x = tf.keras.layers.Flatten()(x)
  x = tf.keras.layers.Dense(512, activation="relu")(x)
  x = tf.keras.layers.Dense(info.features['label'].num_classes)(x)
  x = tf.keras.layers.Activation("softmax")(x)
  model_functional = tf.keras.Model(inputs=inputs, outputs=x)

Code to reproduce the issue (Google Colab notebook) https://gist.github.com/NobuoTsukamoto/b42128104531a7612e5c85e246cb2dac

Screenshots

Additional context

alanchiao commented 4 years ago

I skimmed through your colab. Could you try one thing I didn't see?

If you try taking your "Keras model without quantization aware training" (0.99), converting it to TFLite, and then evaluating it in a manner similar to how you got the 0.20% accuracy number, could you see what you get?

NobuoTsukamoto commented 4 years ago

I updated colab notebook. https://gist.github.com/NobuoTsukamoto/b42128104531a7612e5c85e246cb2dac

If you try taking your "Keras model without quantization aware training" (0.99), converting it to TFLite, and then evaluating it in a manner similar to how you got the 0.20% accuracy number, could you see what you get?

kmkolasinski commented 4 years ago

Can you try to train your q_aware model much longer, e.g.

q_aware_history = q_aware_model.fit(train.repeat(),
                                    initial_epoch=10,
                                    epochs=200,
                                    steps_per_epoch=500,
                                    validation_data=validation.repeat(),
                                    validation_steps=validation_steps)

there are running exponential averages in the quantized layers which may need to converge.

kmkolasinski commented 4 years ago

You can take a look on this issue: https://github.com/tensorflow/model-optimization/issues/309 TLDR: I had similar problem, but when I trained quantization aware model for a longer time, the gap between keras and tflite model decreased.

NobuoTsukamoto commented 4 years ago

@kmkolasinski Thanks for your information.

I tried two patterns ( training with QAT).

  1. epochs=50 Keras model (QAT): 0.99 , TF-Lite integer quant model (QAT): 0.55
  2. epochs=100 Keras model (QAT): 1.00 , TF-Lite integer quant model (QAT): 1.00

I need quite a long time and a large number of epochs. Also, It is not possible to confirm the gap between the Keras model and the TF-Lite model from the accuracy and loss metrics.

How can I find that the gap disappears during training? Also, can I guess how many epochs to set? (According to #309, it didn't seem possible ...)

alanchiao commented 4 years ago

@NobuoTsukamoto, @krzys-ostrowski: this is good feedback.

Just from the analysis, there are some things we could possibly do:

1) Have Tensorboard log the exponential averages so you can see them converge, through a new callback for QAT.

and then with regards to how long it takes

2) More intelligently initialize the exponential averages which track the min/max values of weights/activations to reflect things that are fixed for the activations (e.g. 0 as minimum for RELU) 3) Have the exponential averages change more quickly at the start of training ("zero_debias") from their initialized values or modify ema_decay - I'm not sure how well this would work across models

kmkolasinski commented 4 years ago

Indeed, having native callback for EMA monitoring would be a nice feature.

Additionally, since EMA decay in the moving average quantizer is set to beta=0.999 we need approximately 1000 steps to 'forget' about the initial state. Here is a table which shows how many steps you need to 'forget' about the initial state of the quantizer min/max values: image

Probably, setting the default EMA decay to 0.995 would be a better choice for users with simpler problems.

One can also monitor the GAP between Keras model and TFLite during training via custom callback. For example I use model output statistics as a proxy for measuring the GAP. Here is how it looks like in my case (source):

INFO:tensorflow:Measured deviation between keras and tflite model:
INFO:tensorflow:
 - export/objectness/output  
    MAE     =  0.000759 
    RMSE    =  0.003540 
    Keras   = N(μ=  0.030039, σ=  0.143009)
    tflite  = N(μ=  0.030201, σ=  0.143635)
 - export/box_shape/output   
    MAE     =  0.001983 
    RMSE    =  0.003444 
    Keras   = N(μ=  0.413126, σ=  0.267102)
    tflite  = N(μ=  0.413075, σ=  0.266314)
 - export/classes/output     
    MAE     =  0.000494 
    RMSE    =  0.011030 
    Keras   = N(μ=  0.000562, σ=  0.012718)
    tflite  = N(μ=  0.000565, σ=  0.013538)

The problem with this approach is that, predictions through TFLite model can be very slow on non arm architectures and this type of test should be run in background in order to not block the training loop.

NobuoTsukamoto commented 4 years ago

Have Tensorboard log the exponential averages so you can see them converge, through a new callback for QAT.

It would be nice if the convergence can see in the Tensorboard log. Like "pruning_callbacks", if "a new callback for QAT" is keras.callbacks, I think it's very easy to use.

nutsiepully commented 4 years ago

I think there is likely some confusion here. Exponential Moving Average is used during QAT to calculate the ranges of dynamic tensors. Since the initial cold start is [-6, 6], it can lead to a huge accuracy drop at the beginning of QAT. Say a tensor only has values in [-0.1, 0.1], then most of the range is wasted and can lead to huge losses.

As training goes on, this range slowly converges to the actual range. As @kmkolasinski mentioned, ~1000 steps. And the QAT accuracy goes up.

However, when converting to TFLite, these same ranges are used which are used in QAT. So the TF and TFLite accuracy and values should be very close. QAT tries to emulate TFLite as closely as possible, and there shouldn't be such divergences.

We don't see it in our local tests either. For example, if you run quantize_functional_test, you'll see that the results for TF QAT and TFLite are the same.

There can be some subtle differences. We don't place FakeQuants after Softmax for instance since it hinders with convergence. There's a possibility that's happening, but I can't be sure of it. I'm trying to recreate the issue.

kmkolasinski commented 4 years ago

There is a chance that I'm doing something wrong, however It seems that I'm not the only one with this issue. You can check much bigger model than the one used in the quantize_functional_test. I have encountered this issue with MobileNetV2. When models get bigger the errors between emulated quantization and the real one will accumulate.

nutsiepully commented 4 years ago

We've found the issue. One of the quantized kernel activation ranges had a problem, but was getting hidden when the range has converged.

We'll have a fix out soon. tf-nightly should have it.

nutsiepully commented 4 years ago

Thanks a lot for your help reporting and helping reproduce this issue. Would've been really hard to narrow down without the reproduction code.

sayakpaul commented 4 years ago

@nutsiepully could you mention if there's any specific version of TensorFlow that would have the fix? Or should pip install tf-nightly should do it?

kmkolasinski commented 4 years ago

Cool thanks for feedback @nutsiepully ! I will check it today. Out of curiosity, was it some general issue or something related to MobileNet models or specific layer etc ?

@sayakpaul Yes, you can also use pip install tf-nightly --upgrade, but you need to uninstall regular TF first.

nutsiepully commented 4 years ago

@sayakpaul - tf-nightly should do it. the next version release will have it.

@kmkolasinski - I'll point out the commit here once it's in so you can see it. It was a general issue, with the DepthConv kernel implementation, which got triggered when ranges hadn't converged.

kmkolasinski commented 4 years ago

Thanks, it makes sense to me, few weeks ago I've switched to a custom ResNet model which does not have DepthConvs and I got better results.

sayakpaul commented 4 years ago

Thanks for letting me know. I will check and report back.

sayakpaul commented 4 years ago

@nutsiepully I can definitely see the improvement and this Colab Gist reproduces this.

Additionally, I worked on this report for folks to make the onboarding process for quantization a bit easier. It incorporates many of your suggestions as well. Happy to address any feedback.

Thank you so much for all your help :)

nutsiepully commented 4 years ago

Thanks a lot @sayakpaul. Really appreciate the feedback and the effort.

Thanks @kmkolasinski and @NobuoTsukamoto for the detailed bug reports and feedback. I'm closing the bug. Please reopen if you face any further issues.

@sayakpaul, the report is awesome! Great work, this explains the value of the tooling really well.

tarushbansal commented 8 months ago

Hi. I’m facing the same issue with MobileNetV3 where I see a large drop in accuracy in the TFLite model compared to the QAT Keras Model. I’m using Tensorflow version 2.15.0 and Tensorflow Model Optimization version 0.7.5. I had to refactor MobileNetV3 a little to make it compatible with QAT by using OnlyOutputQuantizeConfig for the Multiply layers (Moving Average Quantizer) and replacing the Add operations in Hard Sigmoid with Rescaling but I don’t think that should be the cause of this issue? Would appreciate any help. Thanks!

KBOUSTM commented 1 month ago

Hello @tarushbansal, I have the same issue. I'm using the last version of tfmot (0.8.0), i had to make the MobilenetV3 QAT friendly then i got a huge gap between the QAT model accuracy and tflite model accuracy. Did you find any solution to this issue? Thanks!