intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.23k stars 257 forks source link

Get different results while trying to reproduce onnxrt benchmark example #819

Closed SunCrazy closed 1 year ago

SunCrazy commented 1 year ago

@chensuyue I have tried the case in examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptg_static, but can not reproduce the results shown in https://intel.github.io/neural-compressor/latest/docs/source/validated_model_list.html. The full log is shown as follows:

2023-04-20 19:22:12 [WARNING] Force convert framework model to neural_compressor model.
2023-04-20 19:22:12 [INFO] Start auto tuning.
2023-04-20 19:22:12 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2023-04-20 19:22:12 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2023-04-20 19:22:12 [INFO] Adaptor has 4 recipes.
2023-04-20 19:22:12 [INFO] 0 recipes specified by user.
2023-04-20 19:22:12 [INFO] 3 recipes require future tuning.
2023-04-20 19:22:12 [INFO] *** Initialize auto tuning
2023-04-20 19:22:12 [INFO] Get FP32 model baseline.
2023-04-20 19:29:57 [INFO] Save tuning history to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-20_19-22-09/./history.snapshot.
2023-04-20 19:29:57 [INFO] FP32 baseline is: [Accuracy: 0.6689, Duration (seconds): 464.4234]
2023-04-20 19:29:57 [INFO] Quantize the model with default config.
2023-04-20 19:30:03 [WARNING] Per-channel support with QDQ format requires opset version >= 13, use per-tensor granularity instead
2023-04-20 19:30:04 [INFO] |********Mixed Precision Statistics*******|
2023-04-20 19:30:04 [INFO] +-------------------+-------+------+------+
2023-04-20 19:30:04 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-20 19:30:04 [INFO] +-------------------+-------+------+------+
2023-04-20 19:30:04 [INFO] |        Conv       |   52  |  52  |  0   |
2023-04-20 19:30:04 [INFO] |       MatMul      |   1   |  1   |  0   |
2023-04-20 19:30:04 [INFO] | GlobalAveragePool |   1   |  0   |  1   |
2023-04-20 19:30:04 [INFO] |   QuantizeLinear  |   66  |  66  |  0   |
2023-04-20 19:30:04 [INFO] |  DequantizeLinear |  171  | 171  |  0   |
2023-04-20 19:30:04 [INFO] +-------------------+-------+------+------+
2023-04-20 19:30:04 [INFO] Pass quantize model elapsed time: 6976.57 ms
2023-04-20 19:37:43 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.6354|0.6689, Duration (seconds) (int8|fp32): 458.7311|464.4234], Best tune result is: n/a
2023-04-20 19:37:43 [INFO] |***********************Tune Result Statistics**********************|
2023-04-20 19:37:43 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:37:43 [INFO] |     Info Type      |  Baseline | Tune 1 result | Best tune result |
2023-04-20 19:37:43 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:37:43 [INFO] |      Accuracy      |  0.6689   |    0.6354     |       n/a        |
2023-04-20 19:37:43 [INFO] | Duration (seconds) | 464.4234  |   458.7311    |       n/a        |
2023-04-20 19:37:43 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:37:43 [INFO] Save tuning history to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-20_19-22-09/./history.snapshot.
2023-04-20 19:37:43 [INFO] *** Start conservative tuning.
2023-04-20 19:37:43 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2023-04-20 19:37:43 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2023-04-20 19:37:43 [INFO] Adaptor has 4 recipes.
2023-04-20 19:37:43 [INFO] 0 recipes specified by user.
2023-04-20 19:37:43 [INFO] 3 recipes require future tuning.
2023-04-20 19:37:43 [INFO] FP32 baseline is: [Accuracy: 0.6689, Duration (seconds): 464.4234]
2023-04-20 19:37:43 [INFO] *** Try to convert op into lower precision to improve performance.
2023-04-20 19:37:43 [INFO] *** Start to convert op into int8.
2023-04-20 19:37:43 [INFO] *** Try to convert all conv ops into int8.
2023-04-20 19:37:49 [WARNING] Per-channel support with QDQ format requires opset version >= 13, use per-tensor granularity instead
2023-04-20 19:37:49 [INFO] |********Mixed Precision Statistics*******|
2023-04-20 19:37:49 [INFO] +-------------------+-------+------+------+
2023-04-20 19:37:49 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-20 19:37:49 [INFO] +-------------------+-------+------+------+
2023-04-20 19:37:49 [INFO] |        Conv       |   52  |  52  |  0   |
2023-04-20 19:37:49 [INFO] |       MatMul      |   1   |  0   |  1   |
2023-04-20 19:37:49 [INFO] |        Clip       |   35  |  0   |  35  |
2023-04-20 19:37:49 [INFO] | GlobalAveragePool |   1   |  0   |  1   |
2023-04-20 19:37:49 [INFO] |   QuantizeLinear  |   97  |  97  |  0   |
2023-04-20 19:37:49 [INFO] |  DequantizeLinear |  201  | 201  |  0   |
2023-04-20 19:37:49 [INFO] +-------------------+-------+------+------+
2023-04-20 19:37:49 [INFO] Pass quantize model elapsed time: 6358.3 ms
2023-04-20 19:45:38 [INFO] Tune 2 result is: [Accuracy (int8|fp32): 0.6354|0.6689, Duration (seconds) (int8|fp32): 469.2151|464.4234], Best tune result is: n/a
2023-04-20 19:45:38 [INFO] |***********************Tune Result Statistics**********************|
2023-04-20 19:45:38 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:45:38 [INFO] |     Info Type      |  Baseline | Tune 2 result | Best tune result |
2023-04-20 19:45:38 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:45:38 [INFO] |      Accuracy      |  0.6689   |    0.6354     |       n/a        |
2023-04-20 19:45:38 [INFO] | Duration (seconds) | 464.4234  |   469.2151    |       n/a        |
2023-04-20 19:45:38 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:45:38 [INFO] Save tuning history to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-20_19-22-09/./history.snapshot.
2023-04-20 19:45:38 [INFO] *** Convert all conv ops to int8 but accuracy not meet the requirements
2023-04-20 19:45:38 [INFO] ***Current result dict_items([('conv', 'fp32'), ('matmul', None), ('linear', None)])
2023-04-20 19:45:38 [INFO] *** Try to convert all matmul ops into int8.
2023-04-20 19:45:40 [WARNING] Per-channel support with QDQ format requires opset version >= 13, use per-tensor granularity instead
2023-04-20 19:45:40 [INFO] |********Mixed Precision Statistics*******|
2023-04-20 19:45:40 [INFO] +-------------------+-------+------+------+
2023-04-20 19:45:40 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-20 19:45:40 [INFO] +-------------------+-------+------+------+
2023-04-20 19:45:40 [INFO] |        Conv       |   52  |  0   |  52  |
2023-04-20 19:45:40 [INFO] |       MatMul      |   1   |  1   |  0   |
2023-04-20 19:45:40 [INFO] |        Clip       |   35  |  0   |  35  |
2023-04-20 19:45:40 [INFO] | GlobalAveragePool |   1   |  0   |  1   |
2023-04-20 19:45:40 [INFO] |   QuantizeLinear  |   2   |  2   |  0   |
2023-04-20 19:45:40 [INFO] |  DequantizeLinear |   3   |  3   |  0   |
2023-04-20 19:45:40 [INFO] +-------------------+-------+------+------+
2023-04-20 19:45:40 [INFO] Pass quantize model elapsed time: 1575.32 ms
2023-04-20 19:53:36 [INFO] Tune 3 result is: [Accuracy (int8|fp32): 0.6684|0.6689, Duration (seconds) (int8|fp32): 475.8282|464.4234], Best tune result is: [Accuracy: 0.6684, Duration (seconds): 475.8282]
2023-04-20 19:53:36 [INFO] |***********************Tune Result Statistics**********************|
2023-04-20 19:53:36 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:53:36 [INFO] |     Info Type      |  Baseline | Tune 3 result | Best tune result |
2023-04-20 19:53:36 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:53:36 [INFO] |      Accuracy      |  0.6689   |    0.6684     |     0.6684       |
2023-04-20 19:53:36 [INFO] | Duration (seconds) | 464.4234  |   475.8282    |    475.8282      |
2023-04-20 19:53:36 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-20 19:53:36 [INFO] Save tuning history to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-20_19-22-09/./history.snapshot.
2023-04-20 19:53:36 [INFO] *** Do not stop the tuning process, re-quantize the ops.
2023-04-20 19:53:36 [INFO] *** Convert all matmul ops to int8 and accuracy still meet the requirements
2023-04-20 19:53:36 [INFO] ***Current result dict_items([('conv', 'fp32'), ('matmul', 'int8'), ('linear', None)])
2023-04-20 19:53:36 [INFO] *** Ending tuning process due to no quantifiable op left.
2023-04-20 19:53:36 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.

Although the final accuracy is 0.6684, the Matmul is quantized with int8 but other ops are float32, it is not what we need actually.

612

chensuyue commented 1 year ago

The results shown in https://intel.github.io/neural-compressor/latest/docs/source/validated_model_list.html is measured by INC v2.0, we will soon update the data measure with v2.1.
Are you get this results by INC 2.1 + ONNXRT 1.13.1?
And please also let me know the torch and torchvision version that used to export the onnx model. let me try to reproduce your results.

In our test with INC 2.1 + ONNXRT 1.13.1 it shows:

2023-04-09 18:52:30 [INFO] |********Mixed Precision Statistics*******|
2023-04-09 18:52:30 [INFO] +-------------------+-------+------+------+
2023-04-09 18:52:30 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-09 18:52:30 [INFO] +-------------------+-------+------+------+
2023-04-09 18:52:30 [INFO] |        Conv       |   52  |  52  |  0   |
2023-04-09 18:52:30 [INFO] |       Gather      |   1   |  0   |  1   |
2023-04-09 18:52:30 [INFO] |       MatMul      |   1   |  1   |  0   |
2023-04-09 18:52:30 [INFO] | GlobalAveragePool |   1   |  1   |  0   |
2023-04-09 18:52:30 [INFO] |        Add        |   11  |  11  |  0   |
2023-04-09 18:52:30 [INFO] |      Reshape      |   1   |  1   |  0   |
2023-04-09 18:52:30 [INFO] |       Concat      |   1   |  0   |  1   |
2023-04-09 18:52:30 [INFO] |     Unsqueeze     |   1   |  0   |  1   |
2023-04-09 18:52:30 [INFO] |   QuantizeLinear  |   1   |  1   |  0   |
2023-04-09 18:52:30 [INFO] |  DequantizeLinear |   2   |  2   |  0   |
2023-04-09 18:52:30 [INFO] +-------------------+-------+------+------+
2023-04-09 18:52:30 [INFO] Pass quantize model elapsed time: 11602.36 ms
2023-04-09 18:59:50 [DEBUG] Best acc is 0.65492.
2023-04-09 18:59:50 [DEBUG] *** Update the best qmodel with the result (0.65492, [440.14830350875854])
2023-04-09 18:59:50 [DEBUG] *** Accuracy not meets the requirements, do not update the best qmodel.
2023-04-09 18:59:50 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.6549|0.6689, Duration (seconds) (int8|fp32): 440.1483|584.6360], Best tune result is: [Accuracy: 0.6549, Duration (seconds): 440.1483]
2023-04-09 18:59:50 [INFO] |***********************Tune Result Statistics**********************|
2023-04-09 18:59:50 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-09 18:59:50 [INFO] |     Info Type      |  Baseline | Tune 1 result | Best tune result |
2023-04-09 18:59:50 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-09 18:59:50 [INFO] |      Accuracy      |  0.6689   |    0.6549     |     0.6549       |
2023-04-09 18:59:50 [INFO] | Duration (seconds) | 584.6360  |   440.1483    |    440.1483      |
2023-04-09 18:59:50 [INFO] +--------------------+-----------+---------------+------------------+
SunCrazy commented 1 year ago

Hi @chensuyue Thanks for your response! There is my envs:

>>> import neural_compressor as inc
>>> inc.__version__
'2.1'
>>> import onnxruntime
>>> onnxruntime.__version__
'1.14.1'
>>> import torch
>>> torch.__version__
'2.0.0+cu117'
>>> import torchvision
>>> torchvision.__version__
'0.15.1+cu117'
>>>
SunCrazy commented 1 year ago

The results shown in https://intel.github.io/neural-compressor/latest/docs/source/validated_model_list.html is measured by INC v2.0, we will soon update the data measure with v2.1. Are you get this results by INC 2.1 + ONNXRT 1.13.1? And please also let me know the torch and torchvision version that used to export the onnx model. let me try to reproduce your results.

In our test with INC 2.1 + ONNXRT 1.13.1 it shows:

2023-04-09 18:52:30 [INFO] |********Mixed Precision Statistics*******|
2023-04-09 18:52:30 [INFO] +-------------------+-------+------+------+
2023-04-09 18:52:30 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-09 18:52:30 [INFO] +-------------------+-------+------+------+
2023-04-09 18:52:30 [INFO] |        Conv       |   52  |  52  |  0   |
2023-04-09 18:52:30 [INFO] |       Gather      |   1   |  0   |  1   |
2023-04-09 18:52:30 [INFO] |       MatMul      |   1   |  1   |  0   |
2023-04-09 18:52:30 [INFO] | GlobalAveragePool |   1   |  1   |  0   |
2023-04-09 18:52:30 [INFO] |        Add        |   11  |  11  |  0   |
2023-04-09 18:52:30 [INFO] |      Reshape      |   1   |  1   |  0   |
2023-04-09 18:52:30 [INFO] |       Concat      |   1   |  0   |  1   |
2023-04-09 18:52:30 [INFO] |     Unsqueeze     |   1   |  0   |  1   |
2023-04-09 18:52:30 [INFO] |   QuantizeLinear  |   1   |  1   |  0   |
2023-04-09 18:52:30 [INFO] |  DequantizeLinear |   2   |  2   |  0   |
2023-04-09 18:52:30 [INFO] +-------------------+-------+------+------+
2023-04-09 18:52:30 [INFO] Pass quantize model elapsed time: 11602.36 ms
2023-04-09 18:59:50 [DEBUG] Best acc is 0.65492.
2023-04-09 18:59:50 [DEBUG] *** Update the best qmodel with the result (0.65492, [440.14830350875854])
2023-04-09 18:59:50 [DEBUG] *** Accuracy not meets the requirements, do not update the best qmodel.
2023-04-09 18:59:50 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.6549|0.6689, Duration (seconds) (int8|fp32): 440.1483|584.6360], Best tune result is: [Accuracy: 0.6549, Duration (seconds): 440.1483]
2023-04-09 18:59:50 [INFO] |***********************Tune Result Statistics**********************|
2023-04-09 18:59:50 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-09 18:59:50 [INFO] |     Info Type      |  Baseline | Tune 1 result | Best tune result |
2023-04-09 18:59:50 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-09 18:59:50 [INFO] |      Accuracy      |  0.6689   |    0.6549     |     0.6549       |
2023-04-09 18:59:50 [INFO] | Duration (seconds) | 584.6360  |   440.1483    |    440.1483      |
2023-04-09 18:59:50 [INFO] +--------------------+-----------+---------------+------------------+

Hi @chensuyue , I noticed that there are some ops(gather/Concat/Unsqueeze) which are not in my onnx model. Are you sure that you're executing the onnx mobilenetv2 model?

I convert the pytorch model into onnx mobilenet_v2 with the following codes:

import torch
import torchvision
batch_size = 1
model = torchvision.models.mobilenet_v2(pretrained=True)
x = torch.randn(batch_size, 3, 224, 224)

# Export the model
torch.onnx.export(model,               # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "mobilenet_v2.onnx",           # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=11,          # the ONNX version to export the model to, please ensure at least 11.
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                                'output' : {0 : 'batch_size'}})
SunCrazy commented 1 year ago

Hi @chensuyue , I have tested the same case in INC2.1 + onnxruntime1.13.1 + torch1.13.0, and still get the different results with you.

In addition, to avoid problems caused by model transformation, I have tested another case in examples/onnxrt/image_recognition/onnx_model_zoo/mobilenet/quantization/ptq_static Which use onnx model that provided by other repo, but unable to obtain correct results also.

chensuyue commented 1 year ago

Hi @SunCrazy, I have reproduce your results and find the root case.

1. Accuracy gap

I think you must export model with opset_version=11 and run quantization with --quant_format=QDQ. So your model didn't support per-channel, you can find the warning in your log 2023-04-20 19:30:03 [WARNING] Per-channel support with QDQ format requires opset version >= 13. And the log I send you is also model with opset_version=11 but use default quant_format.
2 solutions: 1. export model with opset_version=13, 2. quantize with default quant_format.

2. Op list difference

It should caused by torch or torchvision version difference, my test machine convert and stored the fp32 model for a long time so it should use a quite old version. And I also try with the new version it will get the same accuracy although the op list is a little different.

AR for me

What we will do is, we will update the readme, set default opset_version=13 https://github.com/intel/neural-compressor/blob/master/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/README.md#2-prepare-model, so no matter which quant_format will work as expected.

SunCrazy commented 1 year ago

Hi @chensuyue , Even if I convert onnx model with opset 13, I still don't get the correct results.

My Envs as follows:

Python packages:

Package                      Version
---------------------------- ----------
absl-py                      1.4.0
alembic                      1.7.7
astunparse                   1.6.3
bidict                       0.22.1
cachetools                   5.3.0
certifi                      2022.12.7
cffi                         1.15.1
charset-normalizer           3.0.1
click                        8.1.3
cmake                        3.26.3
coloredlogs                  15.0.1
contextlib2                  21.6.0
contourpy                    1.0.7
cryptography                 39.0.1
cycler                       0.11.0
Deprecated                   1.2.13
filelock                     3.12.0
Flask                        2.2.3
Flask-Cors                   3.0.10
Flask-SocketIO               5.3.2
flatbuffers                  23.1.21
fonttools                    4.38.0
gast                         0.4.0
gevent                       22.10.2
gevent-websocket             0.10.1
google-auth                  2.16.1
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
greenlet                     2.0.2
grpcio                       1.51.3
h5py                         3.8.0
humanfriendly                10.0
idna                         3.4
importlib-metadata           6.0.0
importlib-resources          5.12.0
itsdangerous                 2.1.2
Jinja2                       3.1.2
joblib                       1.2.0
keras                        2.11.0
kiwisolver                   1.4.4
libclang                     15.0.6.1
lit                          16.0.1
Mako                         1.2.4
Markdown                     3.4.1
MarkupSafe                   2.1.2
matplotlib                   3.7.0
mpmath                       1.3.0
networkx                     3.1
neural-compressor            2.1
numpy                        1.24.2
nvidia-cublas-cu11           11.10.3.66
nvidia-cuda-cupti-cu11       11.7.101
nvidia-cuda-nvrtc-cu11       11.7.99
nvidia-cuda-runtime-cu11     11.7.99
nvidia-cudnn-cu11            8.5.0.96
nvidia-cufft-cu11            10.9.0.58
nvidia-curand-cu11           10.2.10.91
nvidia-cusolver-cu11         11.4.0.1
nvidia-cusparse-cu11         11.7.4.91
nvidia-nccl-cu11             2.14.3
nvidia-nvtx-cu11             11.7.91
oauthlib                     3.2.2
onnx                         1.13.1
onnxruntime                  1.13.1
onnxruntime-extensions       0.7.0
opencv-python                4.7.0.72
opt-einsum                   3.3.0
packaging                    23.0
pandas                       1.5.3
Pillow                       9.4.0
pip                          23.0.1
prettytable                  3.6.0
protobuf                     3.20.3
psutil                       5.9.4
py-cpuinfo                   9.0.0
pyasn1                       0.4.8
pyasn1-modules               0.2.8
pycocotools                  2.0.6
pycparser                    2.21
pyparsing                    3.0.9
python-dateutil              2.8.2
python-engineio              4.3.4
python-socketio              5.7.2
pytz                         2022.7.1
PyYAML                       6.0
requests                     2.28.2
requests-oauthlib            1.3.1
rsa                          4.9
schema                       0.7.5
scikit-learn                 1.2.1
scipy                        1.10.1
setuptools                   67.1.0
six                          1.16.0
SQLAlchemy                   1.4.27
sympy                        1.11.1
tensorboard                  2.11.2
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorflow                   2.11.0
tensorflow-estimator         2.11.0
tensorflow-io-gcs-filesystem 0.31.0
termcolor                    2.2.0
threadpoolctl                3.1.0
torch                        1.13.0
torchvision                  0.14.0
triton                       2.0.0
typing_extensions            4.5.0
urllib3                      1.26.14
wcwidth                      0.2.6
Werkzeug                     2.2.3
wheel                        0.38.4
wrapt                        1.15.0
zipp                         3.15.0
zope.event                   4.6
zope.interface               5.5.2

Steps to quantize and tune model:

  1. convert to onnx:
import torch
import torchvision
batch_size = 1
model = torchvision.models.mobilenet_v2(pretrained=True)
x = torch.randn(batch_size, 3, 224, 224)

# Export the model
torch.onnx.export(model,               # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "mobilenet_v2_tv0.14_op13.onnx",           # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=13,          # the ONNX version to export the model to, please ensure at least 11.
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                                'output' : {0 : 'batch_size'}})

some log:

/mnt/ssd/chenf/software/pyenv/neural-compressor/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/mnt/ssd/chenf/software/pyenv/neural-compressor/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MobileNet_V2_Weights.IMAGENET1K_V1`. You can also use `weights=MobileNet_V2_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
  1. tune model
bash run_tuning.sh --input_model=mobilenet_v2_tv0.14_op13.onnx --dataset_location=/mnt/ssd/share/ILSVRC2012_img_val/ --output_model=mobilenet_v2_tv0.14_op13_qdq.onnx --label_path=val.txt --quant_format=QDQ

run log:

+ main --input_model=mobilenet_v2_tv0.14_op13.onnx --dataset_location=/mnt/ssd/share/ILSVRC2012_img_val/ --output_model=mobilenet_v2_tv0.14_op13_qdq.onnx --label_path=val.txt --quant_format=QDQ
+ init_params --input_model=mobilenet_v2_tv0.14_op13.onnx --dataset_location=/mnt/ssd/share/ILSVRC2012_img_val/ --output_model=mobilenet_v2_tv0.14_op13_qdq.onnx --label_path=val.txt --quant_format=QDQ
+ for var in "$@"
+ case $var in
++ echo --input_model=mobilenet_v2_tv0.14_op13.onnx
++ cut -f2 -d=
+ input_model=mobilenet_v2_tv0.14_op13.onnx
+ for var in "$@"
+ case $var in
++ echo --dataset_location=/mnt/ssd/share/ILSVRC2012_img_val/
++ cut -f2 -d=
+ dataset_location=/mnt/ssd/share/ILSVRC2012_img_val/
+ for var in "$@"
+ case $var in
++ echo --output_model=mobilenet_v2_tv0.14_op13_qdq.onnx
++ cut -f2 -d=
+ output_model=mobilenet_v2_tv0.14_op13_qdq.onnx
+ for var in "$@"
+ case $var in
++ echo --label_path=val.txt
++ cut -f2 -d=
+ label_path=val.txt
+ for var in "$@"
+ case $var in
++ echo --quant_format=QDQ
++ cut -f2 -d=
+ quant_format=QDQ
+ run_tuning
+ python main.py --model_path mobilenet_v2_tv0.14_op13.onnx --dataset_location /mnt/ssd/share/ILSVRC2012_img_val/ --label_path val.txt --output_model mobilenet_v2_tv0.14_op13_qdq.onnx --quant_format QDQ --tune
2023-04-25 10:58:22.764579: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-25 10:58:23.793518: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /mnt/ssd/chenf/software/pyenv/neural-compressor/lib/python3.8/site-packages/cv2/../../lib64:/mnt/ssd/chenf/software/cuda11.7/lib64:
2023-04-25 10:58:23.793649: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /mnt/ssd/chenf/software/pyenv/neural-compressor/lib/python3.8/site-packages/cv2/../../lib64:/mnt/ssd/chenf/software/cuda11.7/lib64:
2023-04-25 10:58:23.793666: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-04-25 10:58:24 [WARNING] Force convert framework model to neural_compressor model.
2023-04-25 10:58:25 [INFO] Start auto tuning.
2023-04-25 10:58:25 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2023-04-25 10:58:25 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2023-04-25 10:58:25 [INFO] Adaptor has 4 recipes.
2023-04-25 10:58:25 [INFO] 0 recipes specified by user.
2023-04-25 10:58:25 [INFO] 3 recipes require future tuning.
2023-04-25 10:58:25 [INFO] *** Initialize auto tuning
2023-04-25 10:58:25 [INFO] Get FP32 model baseline.
2023-04-25 11:06:25 [INFO] Save tuning history to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-25_10-58-21/./history.snapshot.
2023-04-25 11:06:25 [INFO] FP32 baseline is: [Accuracy: 0.6689, Duration (seconds): 480.2326]
2023-04-25 11:06:25 [INFO] Quantize the model with default config.
2023-04-25 11:06:32 [INFO] |********Mixed Precision Statistics*******|
2023-04-25 11:06:32 [INFO] +-------------------+-------+------+------+
2023-04-25 11:06:32 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-25 11:06:32 [INFO] +-------------------+-------+------+------+
2023-04-25 11:06:32 [INFO] |        Conv       |   52  |  52  |  0   |
2023-04-25 11:06:32 [INFO] |       MatMul      |   1   |  1   |  0   |
2023-04-25 11:06:32 [INFO] | GlobalAveragePool |   1   |  0   |  1   |
2023-04-25 11:06:32 [INFO] |   QuantizeLinear  |   66  |  66  |  0   |
2023-04-25 11:06:32 [INFO] |  DequantizeLinear |  171  | 171  |  0   |
2023-04-25 11:06:32 [INFO] +-------------------+-------+------+------+
2023-04-25 11:06:32 [INFO] Pass quantize model elapsed time: 7174.15 ms
2023-04-25 11:14:32 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.5866|0.6689, Duration (seconds) (int8|fp32): 479.3143|480.2326], Best tune result is: n/a
2023-04-25 11:14:32 [INFO] |***********************Tune Result Statistics**********************|
2023-04-25 11:14:32 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:14:32 [INFO] |     Info Type      |  Baseline | Tune 1 result | Best tune result |
2023-04-25 11:14:32 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:14:32 [INFO] |      Accuracy      |  0.6689   |    0.5866     |       n/a        |
2023-04-25 11:14:32 [INFO] | Duration (seconds) | 480.2326  |   479.3143    |       n/a        |
2023-04-25 11:14:32 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:14:32 [INFO] Save tuning history to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-25_10-58-21/./history.snapshot.
2023-04-25 11:14:32 [INFO] *** Start conservative tuning.
2023-04-25 11:14:32 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2023-04-25 11:14:32 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
2023-04-25 11:14:32 [INFO] Adaptor has 4 recipes.
2023-04-25 11:14:32 [INFO] 0 recipes specified by user.
2023-04-25 11:14:32 [INFO] 3 recipes require future tuning.
2023-04-25 11:14:32 [INFO] FP32 baseline is: [Accuracy: 0.6689, Duration (seconds): 480.2326]
2023-04-25 11:14:32 [INFO] *** Try to convert op into lower precision to improve performance.
2023-04-25 11:14:32 [INFO] *** Start to convert op into int8.
2023-04-25 11:14:32 [INFO] *** Try to convert all conv ops into int8.
2023-04-25 11:14:39 [INFO] |********Mixed Precision Statistics*******|
2023-04-25 11:14:39 [INFO] +-------------------+-------+------+------+
2023-04-25 11:14:39 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-25 11:14:39 [INFO] +-------------------+-------+------+------+
2023-04-25 11:14:39 [INFO] |        Conv       |   52  |  52  |  0   |
2023-04-25 11:14:39 [INFO] |       MatMul      |   1   |  0   |  1   |
2023-04-25 11:14:39 [INFO] |        Clip       |   35  |  0   |  35  |
2023-04-25 11:14:39 [INFO] | GlobalAveragePool |   1   |  0   |  1   |
2023-04-25 11:14:39 [INFO] |   QuantizeLinear  |   97  |  97  |  0   |
2023-04-25 11:14:39 [INFO] |  DequantizeLinear |  201  | 201  |  0   |
2023-04-25 11:14:39 [INFO] +-------------------+-------+------+------+
2023-04-25 11:14:39 [INFO] Pass quantize model elapsed time: 6998.1 ms
2023-04-25 11:22:46 [INFO] Tune 2 result is: [Accuracy (int8|fp32): 0.5869|0.6689, Duration (seconds) (int8|fp32): 486.7922|480.2326], Best tune result is: n/a
2023-04-25 11:22:46 [INFO] |***********************Tune Result Statistics**********************|
2023-04-25 11:22:46 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:22:46 [INFO] |     Info Type      |  Baseline | Tune 2 result | Best tune result |
2023-04-25 11:22:46 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:22:46 [INFO] |      Accuracy      |  0.6689   |    0.5869     |       n/a        |
2023-04-25 11:22:46 [INFO] | Duration (seconds) | 480.2326  |   486.7922    |       n/a        |
2023-04-25 11:22:46 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:22:46 [INFO] Save tuning history to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-25_10-58-21/./history.snapshot.
2023-04-25 11:22:46 [INFO] *** Convert all conv ops to int8 but accuracy not meet the requirements
2023-04-25 11:22:46 [INFO] ***Current result dict_items([('conv', 'fp32'), ('matmul', None), ('linear', None)])
2023-04-25 11:22:46 [INFO] *** Try to convert all matmul ops into int8.
2023-04-25 11:22:47 [INFO] |********Mixed Precision Statistics*******|
2023-04-25 11:22:47 [INFO] +-------------------+-------+------+------+
2023-04-25 11:22:47 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-25 11:22:47 [INFO] +-------------------+-------+------+------+
2023-04-25 11:22:47 [INFO] |        Conv       |   52  |  0   |  52  |
2023-04-25 11:22:47 [INFO] |       MatMul      |   1   |  1   |  0   |
2023-04-25 11:22:47 [INFO] |        Clip       |   35  |  0   |  35  |
2023-04-25 11:22:47 [INFO] | GlobalAveragePool |   1   |  0   |  1   |
2023-04-25 11:22:47 [INFO] |   QuantizeLinear  |   2   |  2   |  0   |
2023-04-25 11:22:47 [INFO] |  DequantizeLinear |   3   |  3   |  0   |
2023-04-25 11:22:47 [INFO] +-------------------+-------+------+------+
2023-04-25 11:22:47 [INFO] Pass quantize model elapsed time: 1484.04 ms
2023-04-25 11:30:54 [INFO] Tune 3 result is: [Accuracy (int8|fp32): 0.6685|0.6689, Duration (seconds) (int8|fp32): 486.5853|480.2326], Best tune result is: [Accuracy: 0.6685, Duration (seconds): 486.5853]
2023-04-25 11:30:54 [INFO] |***********************Tune Result Statistics**********************|
2023-04-25 11:30:54 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:30:54 [INFO] |     Info Type      |  Baseline | Tune 3 result | Best tune result |
2023-04-25 11:30:54 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:30:54 [INFO] |      Accuracy      |  0.6689   |    0.6685     |     0.6685       |
2023-04-25 11:30:54 [INFO] | Duration (seconds) | 480.2326  |   486.5853    |    486.5853      |
2023-04-25 11:30:54 [INFO] +--------------------+-----------+---------------+------------------+
2023-04-25 11:30:54 [INFO] Save tuning history to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-25_10-58-21/./history.snapshot.
2023-04-25 11:30:54 [INFO] *** Do not stop the tuning process, re-quantize the ops.
2023-04-25 11:30:54 [INFO] *** Convert all matmul ops to int8 and accuracy still meet the requirements
2023-04-25 11:30:54 [INFO] ***Current result dict_items([('conv', 'fp32'), ('matmul', 'int8'), ('linear', None)])
2023-04-25 11:30:54 [INFO] *** Ending tuning process due to no quantifiable op left.
2023-04-25 11:30:54 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2023-04-25 11:30:54 [INFO] Save deploy yaml to /mnt/ssd/chenf/opensource/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-25_10-58-21/deploy.yaml

The final quantization config and model accuracy are not inconsistent with you.

chensuyue commented 1 year ago

Hi @SunCrazy, I couldn't reproduce your results with exactly the same config. Could you share me the fp32 model and quantized model? I want to verify your model in my env.

my local env

(onnxrt-1.13.1-3.8-clx070-8280) [tensorflow@mlt-clx070 ptq_static]$ pip list | grep torch
torch                        1.13.0
torchvision                  0.14.0
(onnxrt-1.13.1-3.8-clx070-8280) [tensorflow@mlt-clx070 ptq_static]$ pip list | grep onnx
onnx                         1.13.1
onnxruntime                  1.13.1
onnxruntime-extensions       0.7.0
(onnxrt-1.13.1-3.8-clx070-8280) [tensorflow@mlt-clx070 ptq_static]$ pip list | grep neural
neural-compressor            2.1

quantize cmd

bash run_tuning.sh --dataset_location=/tf_dataset2/datasets/imagenet/ImagenetRaw/ILSVRC2012_img_val --input_model=mobilenet_v2_13.onnx --output_model=onnxrt-mobilenet_v2_13-tune.onnx --quant_format=QDQ

result

2023-04-26 10:59:27 [INFO] *** Initialize auto tuning
2023-04-26 10:59:27 [INFO] Get FP32 model baseline.
2023-04-26 11:28:59 [INFO] Save tuning history to /home/tensorflow/suyue/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-26_10-59-21/./history.snapshot.
2023-04-26 11:28:59 [INFO] FP32 baseline is: [Accuracy: 0.6689, Duration (seconds): 1771.0930]
2023-04-26 11:28:59 [INFO] Quantize the model with default config.
2023-04-26 11:29:05 [INFO] |********Mixed Precision Statistics*******|
2023-04-26 11:29:05 [INFO] +-------------------+-------+------+------+
2023-04-26 11:29:05 [INFO] |      Op Type      | Total | INT8 | FP32 |
2023-04-26 11:29:05 [INFO] +-------------------+-------+------+------+
2023-04-26 11:29:05 [INFO] |        Conv       |   52  |  52  |  0   |
2023-04-26 11:29:05 [INFO] |       MatMul      |   1   |  1   |  0   |
2023-04-26 11:29:05 [INFO] | GlobalAveragePool |   1   |  0   |  1   |
2023-04-26 11:29:05 [INFO] |   QuantizeLinear  |   66  |  66  |  0   |
2023-04-26 11:29:05 [INFO] |  DequantizeLinear |  171  | 171  |  0   |
2023-04-26 11:29:05 [INFO] +-------------------+-------+------+------+
2023-04-26 11:29:05 [INFO] Pass quantize model elapsed time: 6908.29 ms
2023-04-26 11:39:53 [INFO] Tune 1 result is: [Accuracy (int8|fp32): 0.6549|0.6689, Duration (seconds) (int8|fp32): 647.3306|1771.0930], Best tune result is: [Accuracy: 0.6549, Duration (seconds): 647.3306]
2023-04-26 11:39:53 [INFO] |***********************Tune Result Statistics***********************|
2023-04-26 11:39:53 [INFO] +--------------------+------------+---------------+------------------+
2023-04-26 11:39:53 [INFO] |     Info Type      |  Baseline  | Tune 1 result | Best tune result |
2023-04-26 11:39:53 [INFO] +--------------------+------------+---------------+------------------+
2023-04-26 11:39:53 [INFO] |      Accuracy      |  0.6689    |    0.6549     |     0.6549       |
2023-04-26 11:39:53 [INFO] | Duration (seconds) | 1771.0930  |   647.3306    |    647.3306      |
2023-04-26 11:39:53 [INFO] +--------------------+------------+---------------+------------------+
2023-04-26 11:39:53 [INFO] Save tuning history to /home/tensorflow/suyue/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-26_10-59-21/./history.snapshot.
2023-04-26 11:39:53 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2023-04-26 11:39:53 [INFO] Save deploy yaml to /home/tensorflow/suyue/neural-compressor/examples/onnxrt/image_recognition/mobilenet_v2/quantization/ptq_static/nc_workspace/2023-04-26_10-59-21/deploy.yaml
chensuyue commented 1 year ago

Another question, how many images in this package /mnt/ssd/share/ILSVRC2012_img_val ? Is that a standard val dataset with 50000 images?

SunCrazy commented 1 year ago

Another question, how many images in this package /mnt/ssd/share/ILSVRC2012_img_val ? Is that a standard val dataset with 50000 images?

Yes, It is standard val dataset.

It's so strange!Even if i execute the case in docker(builded with the dockfile provided by repo), i still get the wrong results. I'm not sure where the problem is anymore.

In addition, I can not give you the fp32 onnx model directly now.

converted onnx model md5: be295389a0fd682f60c8d2a9554010e7

If we use the same torchvison, it will be equal.

I will give you the fp32 onnx model later.

chensuyue commented 1 year ago

Sorry for the late reply, did you still work on this model?

SunCrazy commented 1 year ago

Sorry for the late reply, did you still work on this model?

Sorry I have given up after I execute the case in docker(builded with the dockfile provided by repo). Maybe I will try again next time.

Thanks