intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.23k stars 257 forks source link

Quantization Not Fusing Pad to Conv2D #83

Closed matthew-olson-intel closed 2 years ago

matthew-olson-intel commented 2 years ago

Recently, I quantized a pre-trained ResNet50 model from fp32 to int8, and I noticed that the performance isn't what I expected. The performance is only about 2x compared to the equivalent fp32 model. Investigating further, I noticed in Netron that INC didn't seem to fuse Pad with any later operations.

I've got two int8 models:

  1. Model A: Model that I downloaded from the Intel Model Zoo (https://github.com/IntelAI/models/blob/master/benchmarks/image_recognition/tensorflow/resnet50/inference/int8/README.md).
  2. Model B: Model that I quantized using INC, using the example instructions (https://github.com/intel/neural-compressor/tree/master/examples/tensorflow/object_detection/tensorflow_models/quantization/ptq).

Model A is available from the above link, but here's Model B. GitHub wouldn't allow me to upload PB files, so I changed the file extension to .zip.

My question is this: does INC support fusing Pad operations, or am I running the scripts incorrectly?

matthew-olson-intel commented 2 years ago

In an effort to visualize the issue I'm facing, here's the best that I could do with a single screenshot. The graph on the left is Model B, and the one on the right is Model A.

Screenshot

matthew-olson-intel commented 2 years ago

Oh, I've forgotten to include the script that I'm using to quantize the model. Here's the entirety of it.

#!/bin/bash
export BASEDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

PYTHON="python3"

cd ${BASEDIR}/scripts
rm -rf inc_env
${PYTHON} -m venv inc_env
source ${BASEDIR}/scripts/inc_env/bin/activate
pip install intel-tensorflow
pip install neural-compressor

cd ${BASEDIR}/deps/neural-compressor/examples/tensorflow/image_recognition/tensorflow_models/quantization/ptq/

bash run_tuning.sh \
    --config=resnet50_v1_5.yaml \
    --input_model=${BASEDIR}/models/resnet50_v1.pb \
    --output_model=${BASEDIR}/models/int8.pb
ftian1 commented 2 years ago

thanks for raising such an issue.

INC supports Pad + conv fusion. The problem is if the pad op’s tensor value is not all 0, it may trigger TensorFlow int8 op shape inference check and break. The fix on TF kernel was developed by Intel TensorFlow team and only merged in 1.15up3 but not TF 2.x. If you try to use Intel TensorFlow 1.15 up3 version, you can see the pb having fused ops.

pls note INC is for broad model but not specific optimization

matthew-olson-intel commented 2 years ago

Thanks for the response!

For this specific use-case, we're unable to use 1.x; we've got to use the latest possible TensorFlow, so we're using 2.8.0.

Do you have a timeline for this feature to be pushed up to 2.x? If not, can you send us the PR in 1.15up3?

matthew-olson-intel commented 2 years ago

Is this the PR that should enable this feature?

https://github.com/tensorflow/tensorflow/pull/53480

lvliang-intel commented 1 year ago

@matthew-olson-intel The PR https://github.com/tensorflow/tensorflow/pull/53480 is related with Pad + Conv3D. It's not for Pad + Conv2D. Good news is that this issue has already been fixed by TF SPR-Base and will upstream to stock TF soon. You can try the quantization with SPR-Base as aligned in Teams. Please let's know if you have any issue.