Very inconsistent quantization results on Tensorflow

🐛 Describe the bug

I'm trying to use NNCF for a rec sys model to quantize it to int8. Before using it on our production model, I wanted to get it working on a simple toy example first but am seeing some issues.

I tried to do exactly the same type of quantization two different ways, and they gave very different results.

Approach 1/ is purely using the Keras functional API, and gives very low quantization error.
Approach 2/ uses functional API + a custom keras model class and shows a huge quantization error, suggesting something is wrong.

I've repeated this experiment multiple times but the results don't change. Can someone please help me debug why this is happening? The production model we want to quantize uses something like approach 2/, so we are blocked from using NNCF until we can resolve this.

Thanks!

Environment

pip freeze output:

about-time==4.2.1
absl-py==1.2.0
alive-progress==3.1.5
apache-beam==2.39.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
array-record==0.4.0
astunparse==1.6.3
attrs==20.3.0
autograd==1.6.2
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
cachetools==4.2.4
certifi==2022.9.14
cffi==1.15.1
charset-normalizer==2.1.1
click==7.1.2
cloudpickle==2.2.0
cma==3.2.2
contourpy==1.1.1
crcmod==1.7
cycler==0.12.1
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.13
dill==0.3.1.1
distro-info==1.0
dm-tree==0.1.8
docker==4.4.4
docopt==0.6.2
docstring-parser==0.15
entrypoints==0.4
etils==1.3.0
fastavro==1.6.1
fasteners==0.18
fastjsonschema==2.16.1
fire==0.4.0
flatbuffers==1.12
fonttools==4.53.0
freezegun==1.2.2
future==1.0.0
gast==0.4.0
google-api-core==1.33.1
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==1.35.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-cloud-aiplatform==1.15.0
google-cloud-bigquery==2.34.4
google-cloud-bigquery-storage==2.13.2
google-cloud-bigtable==1.7.2
google-cloud-core==1.7.3
google-cloud-datastore==1.15.5
google-cloud-dlp==3.7.1
google-cloud-language==1.3.2
google-cloud-pubsub==2.13.1
google-cloud-pubsublite==1.4.2
google-cloud-recommendations-ai==0.2.0
google-cloud-resource-manager==1.5.1
google-cloud-spanner==1.19.3
google-cloud-storage==1.44.0
google-cloud-tpu==1.5.0
google-cloud-videointelligence==1.16.3
google-cloud-vision==1.0.2
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.4
grapheme==0.6.0
grpc-google-iam-v1==0.12.4
grpcio==1.49.0
grpcio-gcp==0.2.2
grpcio-status==1.48.1
h5py==3.7.0
hdfs==2.7.0
httplib2==0.19.1
idna==3.4
importlib-metadata==4.12.0
importlib_resources==6.4.0
iniconfig==1.1.1
ipykernel==6.15.3
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==7.7.2
jedi==0.18.1
Jinja2==3.1.2
joblib==1.4.2
jsonschema==3.2.0
jstyleson==0.0.2
jupyter-core==4.11.1
jupyter_client==7.3.5
jupyterlab-pygments==0.2.2
jupyterlab-widgets==1.1.1
keras==2.9.0
Keras-Preprocessing==1.1.2
keras-tuner==1.1.3
kfp==1.8.13
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kiwisolver==1.4.5
kt-legacy==1.0.4
kubernetes==12.0.1
libclang==14.0.6
lxml==4.9.1
Markdown==3.4.1
markdown-it-py==3.0.0
MarkupSafe==2.1.1
matplotlib==3.7.5
matplotlib-inline==0.1.6
mdurl==0.1.2
mistune==2.0.4
ml-metadata==1.9.0
ml-pipelines-sdk==1.9.0
mmh3==3.0.0
natsort==8.4.0
nbclient==0.6.8
nbconvert==7.0.0
nbformat==5.5.0
nest-asyncio==1.5.5
networkx==3.1
ninja==1.11.1.1
nncf==2.10.0
notebook==6.4.12
numpy==1.21.6
oauth2client==4.1.3
oauthlib==3.2.1
openvino-telemetry==2024.1.0
opt-einsum==3.3.0
orjson==3.8.0
overrides==6.2.0
packaging==20.9
pandas==1.3.5
pandocfilters==1.5.0
parameterized==0.8.1
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
pillow==10.3.0
pluggy==1.0.0
portpicker==1.5.2
prometheus-client==0.14.1
promise==2.3
prompt-toolkit==3.0.31
proto-plus==1.22.1
protobuf==3.20.2
psutil==5.9.2
ptyprocess==0.7.0
py==1.11.0
pyarrow==5.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydantic==1.9.2
pydot==1.4.2
pyfarmhash==0.3.2
Pygments==2.13.0
pymongo==3.12.3
pymoo==0.6.1.1
pyparsing==2.4.7
pyrsistent==0.18.1
pytest==7.1.3
python-dateutil==2.8.2
pytz==2022.2.1
PyYAML==5.4.1
pyzmq==24.0.0
requests==2.28.1
requests-oauthlib==1.3.1
requests-toolbelt==0.9.1
rich==13.7.1
rsa==4.9
scikit-learn==1.3.2
scipy==1.7.3
Send2Trash==1.8.0
six==1.16.0
soupsieve==2.3.2.post1
strip-hints==0.1.10
tabulate==0.9.0
tensorboard==2.9.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.0
tensorflow-data-validation==1.9.0
tensorflow-datasets==4.9.2
tensorflow-estimator==2.9.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.27.0
tensorflow-metadata==1.9.0
tensorflow-model-analysis==0.40.0
tensorflow-recommenders==0.6.0
tensorflow-serving-api==2.9.0
tensorflow-transform==1.9.0
termcolor==2.0.1
terminado==0.15.0
tfx==1.9.0
tfx-bsl==1.9.0
threadpoolctl==3.5.0
tinycss2==1.1.1
toml==0.10.2
tomli==2.0.1
tornado==6.2
tqdm==4.66.4
traitlets==5.4.0
typer==0.6.1
typing_extensions==4.3.0
uritemplate==3.0.1
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.4.1
Werkzeug==2.2.2
widgetsnbextension==3.6.1
wrapt==1.14.1
zipp==3.8.1

OS info:

Linux _ 5.10.0-27-cloud-amd64 #1 SMP Debian 5.10.205-2 (2023-12-31) x86_64 GNU/Linux

Hardware info:

          description: CPU
          product: Intel(R) Xeon(R) CPU @ 2.30GHz
          vendor: Intel Corp.
          physical id: 1001

Minimal Reproducible Example

Approach 1/ which gives good results:

# Case 1: Simple model with 1 dense layer using Functional API

# a model that just takes a tensor with 2 ints passes them through a dense layer and returns the result
inputs = tf.keras.Input(shape=(2,))
outputs = tf.keras.layers.Dense(2)(inputs)
model = tf.keras.Model(inputs, outputs)
inputs = tf.constant([[1, 2]])
print('input', inputs)
print('output', model(inputs))

# create nncf dataset for calibration
calibration_dataset = tf.data.Dataset.from_tensor_slices(np.array([[1, 2], [3, 4], [5, 6]])).batch(1)
print('calibration_dataset', calibration_dataset)

tf_quantized_model = nncf.quantize(model, nncf.Dataset(calibration_dataset))

# test quantized model
print('unqunatized output', model(inputs))
print('quantized output', tf_quantized_model(inputs)

Output:

unqunatized output tf.Tensor([[1.6074799  0.71321416]], shape=(1, 2), dtype=float32)
quantized output tf.Tensor([[1.6030035 0.7159472]], shape=(1, 2), dtype=float32)

Score before and after match very well.

Approach 2/, which gives very bad results:

# Case 2: Simple model with 1 dense layer, using subclassing API

tf.keras.utils.get_custom_objects().clear()
@tf.keras.utils.register_keras_serializable()
class BasicModel(tf.keras.Model):
    def __init__(self, name, **kwargs):
        super(BasicModel, self).__init__(name=name, **kwargs)
        self.name_ = name
        self.dense = tf.keras.layers.Dense(2)

    def call(self, inputs):
        return self.dense(inputs)

    def get_config(self):
        config = super().get_config()
        config.update({
            'name': self.name_,
        })
        return config

def make_model():
    input_layer = tf.keras.Input(shape=(2,))
    model = BasicModel(name='basic_model')
    output = model(input_layer)
    return tf.keras.Model(input_layer, output)

model = make_model()
inputs = tf.constant([[1, 2]])
print('input', inputs)
print('output', model(inputs))

# create nncf dataset for calibration
calibration_dataset = tf.data.Dataset.from_tensor_slices(np.array([[1, 2], [3, 4], [5, 6]])).batch(1)
print('calibration_dataset', calibration_dataset)

tf_quantized_model = nncf.quantize(model, nncf.Dataset(calibration_dataset))

print('quantized output', tf_quantized_model(inputs))

Output:

output tf.Tensor([[-1.0133656  -0.18999946]], shape=(1, 2), dtype=float32)
quantized output tf.Tensor([[-1.1608311 -1.3819132]], shape=(1, 2), dtype=float32)

Very large difference in scores.

Repeating exercise several times gives similar results.

Here's a notebook with the two approaches to make repro easier: https://gist.github.com/apoorvu-sharechat/bc3695ead56a98c86518b3dd0a26b5ad

Are you going to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

Hello @apoorvu-sharechat,

Thank you for your interest in NNCF and for your contributions to improving NNCF.

Unfortunately, NNCF does not support quantization the custom Keras model class yet, only built-in Keras model classes. I mean that in case 2 the basic_model is skipped during quantization. As workaround, you can rewrite model using Keras model classes.

Anyway, you highlight a bug in the model transformer and I prepared a fix https://github.com/openvinotoolkit/nncf/pull/2750

Thanks @alexsu52 that's good to know. We could probably rewrite our production model to avoid using a custom model class, but we have some custom layers we use for embeddings etc. Will that also be a problem or are models with custom layers at least supported?

May I know if you are going to use post-training quantization and infer the quantized model via OpenVINO? If yes, then I suggest you to convert the model in OpenVINO and then quantize it (docs: https://docs.openvino.ai/2024/openvino-workflow/model-preparation/convert-model-tensorflow.html). This will save you from rewriting the model.

Will that also be a problem or are models with custom layers at least supported?

Custom layers will be skipped during quantization.

Testing showed that #2750 breaks the distribution training, so I closed the PR. The only possible option is to change the model as follows so that it is loaded correctly from Keras config:

@tf.keras.utils.register_keras_serializable()
class BasicModel(tf.keras.Model):
    def __init__(self, name, dense_config=None, **kwargs):
        super(BasicModel, self).__init__(name=name, **kwargs)
        self.name_ = name
        self.dense = (
            tf.keras.layers.Dense(2) if dense_config is None else tf.keras.layers.Dense.from_config(dense_config)
        )

    def call(self, inputs):
        return self.dense(inputs)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "name": self.name_,
                "dense_config": self.dense.get_config(),
            }
        )
        return config

def make_model():
    input_layer = tf.keras.Input(shape=(2,))
    model = BasicModel(name="basic_model")
    output = model(input_layer)
    return tf.keras.Model(input_layer, output)

Thanks @alexsu52. I was hoping to quantize the model (post training) and continue using TensorFlow Serving to serve the model.

On that note, even if I don't use a custom model class I still see issues: I was trying to benchmark a simple model with a few Dense layers before and after quantization, and the quantized model seems to perform very poorly:

model = tf.keras.Sequential([
    tf.keras.layers.Dense(2000, input_shape=(2,)),
    tf.keras.layers.Dense(1000),
    tf.keras.layers.Dense(1)
])

inputs = tf.constant([[1, 2]])
model(inputs)

calibration_dataset = tf.data.Dataset.from_tensor_slices(np.array([[1, 2], [3, 4], [5, 6]])).batch(1)
quantized_model = nncf.quantize(model, nncf.Dataset(calibration_dataset))

def run_benchmark(model):
  x = tf.constant([[1, 2]])
  N = 10
  for i in range(N):
    y = model(x)

  N = 100
  times = []
  for i in range(N):
    t1 = time.time()
    y = model(x)
    t2 = time.time()
    times.append(t2-t1)
  p50_time = np.percentile(times, 50) * 1000
  print(f"P50 (ms):", p50_time)

run_benchmark(model)
run_benchmark(quantized_model)

Output:

INFO:nncf:Creating compression algorithm: quantization
P50 (ms): 2.215147018432617
P50 (ms): 36.72218322753906

The quantized model is 18x slower than baseline :/ If I look at the layers in the model I can see fake quantization layers like this: dense_19/fake_quantize.

Do I need to convert the quantized model to OpenVino IR for the quantization to actually work?

Given our model serving is built on TensorFlow Serving we would like to continue using it if possible as migration to OVMS would require some effort.

Yes, you need to convert TensorFlow model to OpenVINO model and compile OpenVINO model to see speed up:

model = tf.keras.Sequential(
    [tf.keras.layers.Dense(2000, input_shape=(2,)), tf.keras.layers.Dense(1000), tf.keras.layers.Dense(1)]
)

inputs = tf.constant([[1, 2]])
model(inputs)

calibration_dataset = tf.data.Dataset.from_tensor_slices(np.array([[1, 2], [3, 4], [5, 6]])).batch(1)
quantized_model = nncf.quantize(model, nncf.Dataset(calibration_dataset))

def run_benchmark(model):
    x = tf.constant([[1, 2]])
    N = 10
    for i in range(N):
        y = model(x)

    N = 100
    times = []
    for i in range(N):
        t1 = time.time()
        y = model(x)
        t2 = time.time()
        times.append(t2 - t1)
    p50_time = np.percentile(times, 50) * 1000
    print(f"P50 (ms):", p50_time)

run_benchmark(model)
run_benchmark(quantized_model)

ov_model = ov.convert_model(model)
ov_quantized_model = ov.convert_model(quantized_model)

ov_compiled_model = ov.compile_model(ov_model)
ov_compiled_quantized_model = ov.compile_model(ov_quantized_model)

run_benchmark(ov_compiled_model)
run_benchmark(ov_compiled_quantized_model)

Output:

P50 (ms): 1.6291141510009766
P50 (ms): 19.118547439575195
P50 (ms): 0.14448165893554688
P50 (ms): 0.12946128845214844

Quantizing the TensorFlow model makes sense if you plan to use Quantization Aware Training (QAT). Otherwise, it is better to use the OpenVINO model as an input for nncf.quantize():

import os
import time

import openvino as ov

# Disable GPUs if any on server as we want to quantize for CPU inference
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import numpy as np
import tensorflow as tf

import nncf

model = tf.keras.Sequential(
    [tf.keras.layers.Dense(2000, input_shape=(2,)), tf.keras.layers.Dense(1000), tf.keras.layers.Dense(1)]
)

inputs = tf.constant([[1, 2]])
model(inputs)

# convert model
ov_model = ov.convert_model(model)

calibration_dataset = tf.data.Dataset.from_tensor_slices(np.array([[1, 2], [3, 4], [5, 6]])).batch(1)
ov_quantized_model = nncf.quantize(ov_model, nncf.Dataset(calibration_dataset))

def run_benchmark(model):
    x = tf.constant([[1, 2]])
    N = 10
    for i in range(N):
        y = model(x)

    N = 100
    times = []
    for i in range(N):
        t1 = time.time()
        y = model(x)
        t2 = time.time()
        times.append(t2 - t1)
    p50_time = np.percentile(times, 50) * 1000
    print(f"P50 (ms):", p50_time)

ov_compiled_model = ov.compile_model(ov_model)
ov_compiled_quantized_model = ov.compile_model(ov_quantized_model)

run_benchmark(model)
run_benchmark(ov_compiled_model)
run_benchmark(ov_compiled_quantized_model)

Thanks a lot @alexsu52 . I'll try converting the our model first and then quantizing.

So there is no way to serve models quantized by NNCF using Tensorflow Serving then?

So there is no way to serve models quantized by NNCF using Tensorflow Serving then?

NNCF does not cover this option. Could you share your motivation? Is this necessary for a smooth transition between TensorFlow Serving and OVMS?

@alexsu52 there are some features that we use in our model that don't seem to be supported in OpenVino. E.g., TF Transform: https://www.tensorflow.org/tfx/tutorials/transform/census#export_the_model. At least it's not very clear if they are supported.

To get the benefits of quantization as soon as possible, I was wondering if it was possible to run the quantized models with TF serving, and since the model output by nncf.quantize was a tf.Module, it looked like that might be possible.

But IIUC the tf.Module returned by the quantize function is a sort of annotated version of the original module to tell openvino how quantization should be done, and so it can't be executed in TF serving.

@apoorvu-sharechat, thanks for your answer!

@alexsu52 there are some features that we use in our model that don't seem to be supported in OpenVino. E.g., TF Transform: https://www.tensorflow.org/tfx/tutorials/transform/census#export_the_model. At least it's not very clear if they are supported.

Take a look at https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/optimize-preprocessing.html for information how to optimize preprocessing step using OpenVINO.

To get the benefits of quantization as soon as possible, I was wondering if it was possible to run the quantized models with TF serving, and since the model output by nncf.quantize was a tf.Module, it looked like that might be possible.

But IIUC the tf.Module returned by the quantize function is a sort of annotated version of the original module to tell openvino how quantization should be done, and so it can't be executed in TF serving.

nncf.quantize() quantizes TensorFlow models using standard TensorFlow operations so called fake-quanitze (https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_vars) and returns a quanitzed model which can be infer in "fake-quantize" mode in floating precision. To get performance the quantized model with fake-quantize operations should be converted to OpenVINO or other backend which support TF fake quantized operations, such as TensorFlow Lite.

You can try to add the following code to the PTQ TensorFlow example to see speed-up the quantized model in TensorFlow Lite:

tf_quantized_model = nncf.quantize(
    tf_model,
    calibration_dataset,
    advanced_parameters=nncf.AdvancedQuantizationParameters(
        activations_quantization_params=QuantizationParameters(per_channel=False)
    ),
)

converter = tf.lite.TFLiteConverter.from_keras_model(tf_quantized_model)
tflite_model = converter.convert()
tflite_file = "/tmp/quantized_mnist.tflite"
open(tflite_file, "wb").write(tflite_model)

interpreter = tf.lite.Interpreter(model_path=tflite_file)
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

metric = tf.keras.metrics.CategoricalAccuracy(name="acc@1")
for images, labels in tqdm(val_dataset):
    interpreter.set_tensor(input_index, tf.cast(images, tf.float32))
    interpreter.invoke()
    pred = interpreter.get_tensor(output_index)
    metric.update_state(labels, pred)

print(metric.result())

Disclaimer: NNCF does not quantize model taking account specific TensorFlow Lite runtime. This may result in suboptimal model inference.

Ok that's very helpful, thanks @alexsu52! I was able to get the basic model above quantized, converted to openvino IR, and running with good performance and minimal quantization error :)

I ran into some further issues trying to convert our production model to openvino IR even without the preprocessing, but will raise an issue against the appropriate repo.

openvinotoolkit / nncf