Nano: Model.quantize does not calculate accuracy correctly

The tune result is inconsistent with actual when set metric=tf.keras.metrics.SparseCategoricalAccuracy() code:

import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
import numpy as np
from bigdl.nano.tf.keras import Model

model = MobileNetV2(weights=None, input_shape=(40, 40, 3), classes=10)
model = Model(inputs=model.inputs, outputs=model.outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],)

train_examples = np.random.random((100, 40, 40, 3))
train_labels = np.random.randint(0, 10, size=(100,))
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels)).batch(8)

model.evaluate(train_dataset)

q_model = model.quantize(calib_dataset=train_dataset,
                         metric=tf.keras.metrics.SparseCategoricalAccuracy(),
                         tuning_strategy='basic',
                         accuracy_criterion={'relative': 0.99,
                                            'higher_is_better': True})

m = tf.keras.metrics.SparseCategoricalAccuracy()
for img, label in train_dataset:
    m.update_state(label, model(img))
print('#' * 100)
print("Accuracy: {}".format(m.result().numpy()))

output:

2022-08-04 00:46:04.003972: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
13/13 [==============================] - 2s 24ms/step - loss: 2.3026 - sparse_categorical_accuracy: 0.0800

...

2022-08-04 00:46:30 [INFO] Start to evaluate the TensorFlow model.
2022-08-04 00:46:30 [INFO] Model inference elapsed time: 678.92 ms
2022-08-04 00:46:30 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.0000|0.0000, Duration (seconds) (int8|fp32): 0.6791|0.5165], Best tune result is: [Accuracy: 0.0000, Duration (seconds): 0.6791]
2022-08-04 00:46:30 [INFO] |**********************Tune Result Statistics**********************|
2022-08-04 00:46:30 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:46:30 [INFO] |     Info Type      | Baseline | Tune 1 result | Best tune result |
2022-08-04 00:46:30 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:46:30 [INFO] |      Accuracy      | 0.0000   |    0.0000     |     0.0000       |
2022-08-04 00:46:30 [INFO] | Duration (seconds) | 0.5165   |    0.6791     |     0.6791       |
2022-08-04 00:46:30 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:46:30 [INFO] Save tuning history to /home/projects/BigDL/nc_workspace/2022-08-04_00-46-14/./history.snapshot.
2022-08-04 00:46:30 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2022-08-04 00:46:30 [INFO] Save deploy yaml to /home/projects/BigDL/nc_workspace/2022-08-04_00-46-14/deploy.yaml
####################################################################################################
Accuracy: 0.07999999821186066

Works well using other accuracy metrics (e.g. CategoricalCrossentropy)

CategoricalCrossentropy:

import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from keras.utils.np_utils import to_categorical
import numpy as np
from bigdl.nano.tf.keras import Model

model = MobileNetV2(weights=None, input_shape=(40, 40, 3), classes=10)
model = Model(inputs=model.inputs, outputs=model.outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=[tf.keras.metrics.CategoricalAccuracy()],)

train_examples = np.random.random((100, 40, 40, 3))
train_labels = np.random.randint(0, 10, size=(100,))
train_labels = to_categorical(train_labels, num_classes=10)
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels)).batch(8)

model.evaluate(train_dataset)

q_model = model.quantize(calib_dataset=train_dataset,
                         metric=tf.keras.metrics.CategoricalAccuracy(),
                         tuning_strategy='basic',
                         accuracy_criterion={'relative': 0.99,
                                            'higher_is_better': True})

m = tf.keras.metrics.CategoricalAccuracy()
for img, label in train_dataset:
    m.update_state(label, model(img))
print('#' * 100)
print("Accuracy: {}".format(m.result().numpy()))

output:

2022-08-04 00:50:28.981507: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
13/13 [==============================] - 2s 19ms/step - loss: 2.3026 - categorical_accuracy: 0.0900

...

2022-08-04 00:50:55 [INFO] Start to evaluate the TensorFlow model.
2022-08-04 00:50:56 [INFO] Model inference elapsed time: 653.21 ms
2022-08-04 00:50:56 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.0900|0.0900, Duration (seconds) (int8|fp32): 0.6534|0.4852], Best tune result is: [Accuracy: 0.0900, Duration (seconds): 0.6534]
2022-08-04 00:50:56 [INFO] |**********************Tune Result Statistics**********************|
2022-08-04 00:50:56 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:50:56 [INFO] |     Info Type      | Baseline | Tune 1 result | Best tune result |
2022-08-04 00:50:56 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:50:56 [INFO] |      Accuracy      | 0.0900   |    0.0900     |     0.0900       |
2022-08-04 00:50:56 [INFO] | Duration (seconds) | 0.4852   |    0.6534     |     0.6534       |
2022-08-04 00:50:56 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:50:56 [INFO] Save tuning history to /home/projects/BigDL/nc_workspace/2022-08-04_00-50-40/./history.snapshot.
2022-08-04 00:50:56 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2022-08-04 00:50:56 [INFO] Save deploy yaml to /home/projects/BigDL/nc_workspace/2022-08-04_00-50-40/deploy.yaml
####################################################################################################
Accuracy: 0.09000000357627869

intel-analytics / ipex-llm

Nano: Model.quantize does not calculate accuracy correctly #5305