intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.52k stars 1.24k forks source link

Nano: Model.quantize does not calculate accuracy correctly #5305

Open y199387 opened 2 years ago

y199387 commented 2 years ago

The tune result is inconsistent with actual when set metric=tf.keras.metrics.SparseCategoricalAccuracy() code:

import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
import numpy as np
from bigdl.nano.tf.keras import Model

model = MobileNetV2(weights=None, input_shape=(40, 40, 3), classes=10)
model = Model(inputs=model.inputs, outputs=model.outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],)

train_examples = np.random.random((100, 40, 40, 3))
train_labels = np.random.randint(0, 10, size=(100,))
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels)).batch(8)

model.evaluate(train_dataset)

q_model = model.quantize(calib_dataset=train_dataset,
                         metric=tf.keras.metrics.SparseCategoricalAccuracy(),
                         tuning_strategy='basic',
                         accuracy_criterion={'relative': 0.99,
                                            'higher_is_better': True})

m = tf.keras.metrics.SparseCategoricalAccuracy()
for img, label in train_dataset:
    m.update_state(label, model(img))
print('#' * 100)
print("Accuracy: {}".format(m.result().numpy()))

output:

2022-08-04 00:46:04.003972: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
13/13 [==============================] - 2s 24ms/step - loss: 2.3026 - sparse_categorical_accuracy: 0.0800

...

2022-08-04 00:46:30 [INFO] Start to evaluate the TensorFlow model.
2022-08-04 00:46:30 [INFO] Model inference elapsed time: 678.92 ms
2022-08-04 00:46:30 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.0000|0.0000, Duration (seconds) (int8|fp32): 0.6791|0.5165], Best tune result is: [Accuracy: 0.0000, Duration (seconds): 0.6791]
2022-08-04 00:46:30 [INFO] |**********************Tune Result Statistics**********************|
2022-08-04 00:46:30 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:46:30 [INFO] |     Info Type      | Baseline | Tune 1 result | Best tune result |
2022-08-04 00:46:30 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:46:30 [INFO] |      Accuracy      | 0.0000   |    0.0000     |     0.0000       |
2022-08-04 00:46:30 [INFO] | Duration (seconds) | 0.5165   |    0.6791     |     0.6791       |
2022-08-04 00:46:30 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:46:30 [INFO] Save tuning history to /home/projects/BigDL/nc_workspace/2022-08-04_00-46-14/./history.snapshot.
2022-08-04 00:46:30 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2022-08-04 00:46:30 [INFO] Save deploy yaml to /home/projects/BigDL/nc_workspace/2022-08-04_00-46-14/deploy.yaml
####################################################################################################
Accuracy: 0.07999999821186066

Works well using other accuracy metrics (e.g. CategoricalCrossentropy)

CategoricalCrossentropy:

import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from keras.utils.np_utils import to_categorical
import numpy as np
from bigdl.nano.tf.keras import Model

model = MobileNetV2(weights=None, input_shape=(40, 40, 3), classes=10)
model = Model(inputs=model.inputs, outputs=model.outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=[tf.keras.metrics.CategoricalAccuracy()],)

train_examples = np.random.random((100, 40, 40, 3))
train_labels = np.random.randint(0, 10, size=(100,))
train_labels = to_categorical(train_labels, num_classes=10)
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels)).batch(8)

model.evaluate(train_dataset)

q_model = model.quantize(calib_dataset=train_dataset,
                         metric=tf.keras.metrics.CategoricalAccuracy(),
                         tuning_strategy='basic',
                         accuracy_criterion={'relative': 0.99,
                                            'higher_is_better': True})

m = tf.keras.metrics.CategoricalAccuracy()
for img, label in train_dataset:
    m.update_state(label, model(img))
print('#' * 100)
print("Accuracy: {}".format(m.result().numpy()))

output:

2022-08-04 00:50:28.981507: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
13/13 [==============================] - 2s 19ms/step - loss: 2.3026 - categorical_accuracy: 0.0900

...

2022-08-04 00:50:55 [INFO] Start to evaluate the TensorFlow model.
2022-08-04 00:50:56 [INFO] Model inference elapsed time: 653.21 ms
2022-08-04 00:50:56 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.0900|0.0900, Duration (seconds) (int8|fp32): 0.6534|0.4852], Best tune result is: [Accuracy: 0.0900, Duration (seconds): 0.6534]
2022-08-04 00:50:56 [INFO] |**********************Tune Result Statistics**********************|
2022-08-04 00:50:56 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:50:56 [INFO] |     Info Type      | Baseline | Tune 1 result | Best tune result |
2022-08-04 00:50:56 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:50:56 [INFO] |      Accuracy      | 0.0900   |    0.0900     |     0.0900       |
2022-08-04 00:50:56 [INFO] | Duration (seconds) | 0.4852   |    0.6534     |     0.6534       |
2022-08-04 00:50:56 [INFO] +--------------------+----------+---------------+------------------+
2022-08-04 00:50:56 [INFO] Save tuning history to /home/projects/BigDL/nc_workspace/2022-08-04_00-50-40/./history.snapshot.
2022-08-04 00:50:56 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2022-08-04 00:50:56 [INFO] Save deploy yaml to /home/projects/BigDL/nc_workspace/2022-08-04_00-50-40/deploy.yaml
####################################################################################################
Accuracy: 0.09000000357627869
yangw1234 commented 2 years ago

@zhentaocc could you help take a look at this issue?

zhentaocc commented 2 years ago

predictions should be behind labels.