intel / tools

58 stars 19 forks source link

RetinaNet Quantization #1

Open felipheggaliza opened 5 years ago

felipheggaliza commented 5 years ago

Hi,

I am trying to quantize the RetinaNet topology trained on TensorFlow, but I am getting an error. These are the steps I followed based on these instructions https://github.com/IntelAI/tools/tree/master/tensorflow_quantization:

1) I was able to generate an optimized_graph.pb using the command: bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=/workspace/quantization/frozen_inference_graph.pb --out_graph=/workspace/quantization/optimized_graph.pb --inputs="input_1" --outputs="bboxes,scores,classes" --transforms="fold_batch_norms"

2) But when I tried to run the quantization, using this command:

python tensorflow/tools/quantization/quantize_graph.py --input=/workspace/quantization/optimized_graph.pb --output=/workspace/quantization/quantized_dynamic_range_graph.pb --output_node_names="bboxes,scores,classes" --mode=eightbit --intel_cpu_eightbitize=True

I got this error:

W0422 17:10:22.236689 140385778120448 deprecation.py:323] From tensorflow/tools/quantization/quantize_graph.py:540: remove_training_nodes (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version. Instructions for updating: Use tf.compat.v1.graph_util.remove_training_nodes 2019-04-22 17:10:22.323616: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: AVX512F To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags. 2019-04-22 17:10:22.345101: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2095090000 Hz 2019-04-22 17:10:22.360829: I tensorflow/compiler/xla/service/service.cc:162] XLA service 0x1d7bfdd0 executing computations on platform Host. Devices: 2019-04-22 17:10:22.360862: I tensorflow/compiler/xla/service/service.cc:169] StreamExecutor device (0): , 2019-04-22 17:10:22.367186: I tensorflow/core/common_runtime/process_util.cc:92] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. W0422 17:10:22.368036 140385778120448 deprecation.py:323] From tensorflow/tools/quantization/quantize_graph.py:406: quantize_v2 (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25. Instructions for updating: tf.quantize_v2 is deprecated, please use tf.quantization.quantize instead. Traceback (most recent call last): File "tensorflow/tools/quantization/quantize_graph.py", line 1951, in app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "tensorflow/tools/quantization/quantize_graph.py", line 1937, in main output_graph = rewriter.rewrite(FLAGS.output_node_names.split(",")) File "tensorflow/tools/quantization/quantize_graph.py", line 583, in rewrite self.output_graph) File "tensorflow/tools/quantization/quantize_graph.py", line 1733, in remove_redundant_quantization old_nodes_map = self.create_nodes_map(old_graph) File "tensorflow/tools/quantization/quantize_graph.py", line 506, in create_nodes_map raise ValueError("Duplicate node names detected.") ValueError: Duplicate node names detected.

tonyreina commented 5 years ago

Thanks for the report.

Could you try adding "merge_duplicate_nodes" to the transform_graph in the first step? I'm guessing that might correct it.

bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=/workspace/quantization/frozen_inference_graph.pb --out_graph=/workspace/quantization/optimized_graph.pb --inputs="input_1" --outputs="bboxes,scores,classes" --transforms="fold_batch_norms merge_duplicate_nodes"

Here's the information about all of the graph transformations you can do: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms#merge_duplicate_nodes

Best. Very respectfully, -Tony

nammbash commented 5 years ago

Yes, try the "merge_duplicate_nodes" option and give it a try again.

But , here are some questions: After transform_graph did you run the fp32 graph? Did it work? After transform_graph did you really see the graph_tranform fold? I do not see this, hence the question. Qauntize_graph.py is not very robust. But you can still try and quantize specific portions of this graph and see if it helps.

felipheggaliza commented 5 years ago

Hi @tonyreina, thank you for your support!

I was able to generate the optimized_graph.pb, by following your instructions:

bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=/workspace/quantization/frozen_inference_graph.pb --out_graph=/workspace/quantization/optimized_graph.pb --inputs="input_1" --outputs="bboxes,scores,classes" --transforms="fold_batch_norms merge_duplicate_nodes" 2019-04-23 11:49:48.351521: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying fold_batch_norms 2019-04-23 11:49:48.666155: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying merge_duplicate_nodes

But when I run:

python tensorflow/tools/quantization/quantize_graph.py --input=/workspace/quantization/optimized_graph.pb --output=/workspace/quantization/quantized_dynamic_range_graph.pb --output_node_names="bboxes,scores,classes" --mode=eightbit --intel_cpu_eightbitize=True

I got the following error:

Traceback (most recent call last):
  File "tensorflow/tools/quantization/quantize_graph.py", line 1951, in <module>
    app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tensorflow/tools/quantization/quantize_graph.py", line 1908, in main
    importer.import_graph_def(tf_graph, input_map={}, name="")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 431, in import_graph_def
    raise ValueError(str(e))
ValueError: Node 'bn3a_branch1/beta/read' expects to be colocated with unknown node 'bn3a_branch1/beta'
felipheggaliza commented 5 years ago

Hi @nammbash, thank you for your support!

When I use: --transforms="fold_batch_norms merge_duplicate_nodes" in the transform_graph script, the generated FP32 graph throws this error when loading:

ValueError: Node 'bn3a_branch1/beta/read' expects to be colocated with unknown node 'bn3a_branch1/beta

When I use --transforms="fold_batch_norms" in the transform_graph script, the generated FP32 graph works properly and I am able to use it for inference.

mdfaijul commented 5 years ago

@felipheggaliza would you mind sharing your fp32 graph? We can take a look. We are in development stage of the tools. Thanks for your feedback.

nammbash commented 5 years ago

@felipheggaliza Use the following command: with and without the merge_duplicate_nodes and you should have your optimized FP32 graph. your other FP32 graphs have the following problems: Batchnorms was not folded. Identity nodes were not removed. Duplicates existed.

Hoping that helps.

bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=/workspace/quantization/frozen_inference_graph.pb --out_graph=/workspace/quantization/optimized_graph.pb --inputs="input_1" --outputs="bboxes,scores,classes" --transforms='remove_nodes(op=Identity, op=CheckNumerics, op=StopGradient) fold_old_batch_norms strip_unused_nodes merge_duplicate_nodes'

felipheggaliza commented 5 years ago

Hi @nammbash,

I was able to generate the optimized_graph.pb using your instructions:

root@569f1de3e047:/workspace/tensorflow# bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=/workspace/quantization/frozen_inference_graph.pb --out_graph=/workspace/quantization/optimized_graph.pb --inputs="input_1" --outputs="bboxes,scores,classes" --transforms='remove_nodes(op=Identity, op=CheckNumerics, op=StopGradient) fold_old_batch_norms strip_unused_nodes merge_duplicate_nodes'
2019-04-24 15:58:49.959464: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying remove_nodes
2019-04-24 15:58:50.440454: I tensorflow/tools/graph_transforms/remove_nodes.cc:100] Skipping replacement for bboxes
2019-04-24 15:58:50.440549: I tensorflow/tools/graph_transforms/remove_nodes.cc:100] Skipping replacement for scores
2019-04-24 15:58:50.440598: I tensorflow/tools/graph_transforms/remove_nodes.cc:100] Skipping replacement for classes
2019-04-24 15:58:50.608337: I tensorflow/tools/graph_transforms/remove_nodes.cc:100] Skipping replacement for bboxes
2019-04-24 15:58:50.608426: I tensorflow/tools/graph_transforms/remove_nodes.cc:100] Skipping replacement for scores
2019-04-24 15:58:50.608469: I tensorflow/tools/graph_transforms/remove_nodes.cc:100] Skipping replacement for classes
2019-04-24 15:58:51.037415: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying fold_old_batch_norms
2019-04-24 15:58:52.291227: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying strip_unused_nodes
2019-04-24 15:58:52.416275: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying merge_duplicate_nodes

But I got the following error again (ValueError: Duplicate node names detected.) when trying to run the quantize_graph.py script:

root@569f1de3e047:/workspace/tensorflow# python tensorflow/tools/quantization/quantize_graph.py \
>      --input=/workspace/quantization/optimized_graph.pb \
>      --output=/workspace/quantization/quantized_dynamic_range_graph.pb \
>      --output_node_names="bboxes,scores,classes" \
>      --mode=eightbit \
>      --intel_cpu_eightbitize=True
W0424 15:59:27.347879 139780600755968 deprecation.py:323] From tensorflow/tools/quantization/quantize_graph.py:540: remove_training_nodes (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.remove_training_nodes
2019-04-24 15:59:27.397977: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX512F
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-04-24 15:59:27.418221: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2095090000 Hz
2019-04-24 15:59:27.433305: I tensorflow/compiler/xla/service/service.cc:162] XLA service 0x195a7ab0 executing computations on platform Host. Devices:
2019-04-24 15:59:27.433341: I tensorflow/compiler/xla/service/service.cc:169]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-24 15:59:27.439504: I tensorflow/core/common_runtime/process_util.cc:92] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
W0424 15:59:27.440515 139780600755968 deprecation.py:323] From tensorflow/tools/quantization/quantize_graph.py:406: quantize_v2 (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.quantize_v2` is deprecated, please use `tf.quantization.quantize` instead.

Traceback (most recent call last):
  File "tensorflow/tools/quantization/quantize_graph.py", line 1951, in <module>
    app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tensorflow/tools/quantization/quantize_graph.py", line 1937, in main
    output_graph = rewriter.rewrite(FLAGS.output_node_names.split(","))
  File "tensorflow/tools/quantization/quantize_graph.py", line 583, in rewrite
    self.output_graph)
  File "tensorflow/tools/quantization/quantize_graph.py", line 1733, in remove_redundant_quantization
    old_nodes_map = self.create_nodes_map(old_graph)
  File "tensorflow/tools/quantization/quantize_graph.py", line 506, in create_nodes_map
    raise ValueError("Duplicate node names detected.")
ValueError: Duplicate node names detected.
felipheggaliza commented 5 years ago

Just as an update. I was able to generate the quantized_dynamic_range_graph.pb using an updated version of quantize_graph.py provided by @mdfaijul. The command I used was:

python /tmp/amin/intel-tools/tensorflow_quantization/quantization/quantize_graph.py --input fp32_retinanet_frozen_inference_graph.pb --output quantized_dynamic_range_graph.pb --output_node_names='bboxes,scores,classes' --mode=eightbit --intel_cpu_eightbitize=True --per_channel=True

I was able to generate the logged_quantized_graph.pb using the command:

 bazel-bin/tensorflow/tools/graph_transforms/transform_graph  --in_graph=/workspace/                                                              quantization/quantized_dynamic_range_graph.pb  --out_graph=/workspace/quantization/logged_quantized_graph.pb  --t                                                              ransforms='insert_logging(op=RequantizationRange, show_name=true, message="__requant_min_max:")'
2019-04-30 17:53:13.204935: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying insert_logging

And according to steps listed here: https://github.com/IntelAI/tools/blob/master/tensorflow_quantization/README.md#quantization-tools I am at step 6 and need to run inference in a subset of images in order to get the get the min and max log output.

Unfortunatelly I am not being able to run inferences using the logged_quantized_graph.pb graph because when I load the graph I get this error:

  File "/home/builder/feliphe/<>/retinanet_clx_experiment/retinanet.py", line 67, in _check_load_model
    tf.import_graph_def(od_graph_def, name='')
  File "/home/builder/anaconda3/envs/feliphe/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
 File "/home/builder/anaconda3/envs/feliphe/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 427, in import_graph_def
    graph._c_graph, serialized, options)  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'QuantizedConv2DPerChannel' in binary running on Sys1. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

It seems QuantizedConv2DPerChannel is not present on my current TensorFlow installation. I have tested it using TensorFlow 1.13.1-mkl_py36h27d456a_0 from anaconda channel and TensorFlow 2.0.0-alpha0 (pip install tensorflow==2.0.0-alpha0), and both resulted in the same error mentioned above.

I am looking forward to getting help and working with you until we get the quantization working.

Best Regards,

Feliphe Galiza

karthikvadla commented 5 years ago

@felipheggaliza how are you are generating the calibration data using logged_quantized_graph.pb ? Are you using https://github.com/IntelAI/models/tree/master/benchmarks to run inference?

@mdfaijul should we use your latest PR for this fix? or on intelai/models do we already have image with these fixes? Do you have any other model in mind which uses image with fixes?

@WafaaT / @dmsuehir have you seen this error before? Any docker image you suggest to use?

felipheggaliza commented 5 years ago

Hi @karthikvadla,

I am generating the calibration data using logged_quantized_graph.pb using a subset of 875 images.

I've shared more information about this case in a separated email.

Best Regards,

Feliphe Galiza

felipheggaliza commented 5 years ago

Hi all,

First of all I would like to thank you all for the help you have been giving me in this journey of enabling INT8 and VNNI for inference using RetinaNet on a Cascade Lake server. Special thanks to @nammbash and @mdfaijul who gave me most of the instructions and even a script which saved me from getting stuck in one of the steps.

That being said, I have good and bad news.

• The good news is that I was able to convert the FP32 graph to INT8 using Intel AI Quantization tool(https://github.com/IntelAI/tools) and also it seems VNNI is enabled (I checked it using Intel XED); • The bad news is that on my performance experiments, INT8 is about 1.7x slower than FP32

One of the reasons for getting these results maybe is because something wrong happened with the graph during INT8 conversion.

FLAGS Config. FP32 INT8
Default Parameters 0,8051 1,4189
Intel BKMs 0,2411 0,4262

See more details below:

CPU Info

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Stepping:              6
CPU MHz:               1032.937
CPU max MHz:           3700.0000
CPU min MHz:           1000.0000
BogoMIPS:              4191.58
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-23,48-71
NUMA node1 CPU(s):     24-47,72-95
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req

Installing Bazel and TensorFlow from source using MKL-DNN

wget https://github.com/bazelbuild/bazel/releases/download/0.25.2/bazel-0.25.2-installer-linux-x86_64.sh
chmod +x bazel-0.25.2-installer-linux-x86_64.sh
./bazel-0.25.2-installer-linux-x86_64.sh --user
source $HOME/.bazel/bin/bazel-complete.bash
git clone https://github.com/tensorflow/tensorflow
cd tensorflow/
source activate <my_env> # entering my anaconda environment
./configure
bazel build --config=mkl -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mavx512f --copt=-mavx512pf --copt=-mavx512cd --copt=-mavx512er --verbose_failures //tensorflow/tools/pip_package:build_pip_package
mkdir $HOME/intel-tf-builds
bazel-bin/tensorflow/tools/pip_package/build_pip_package $HOME/intel-tf-builds
cd $HOME/intel-tf-builds
pip install --upgrade --user tensorflow-1.13.1-cp36-cp36m-linux_x86_64.whl

Some outputs related to Tensorflow instalation:

python -V
Python 3.6.5 :: Intel Corporation
>>> import tensorflow as tf
>>> tf.__version__
'1.13.1'
 python -c "import tensorflow; print(tensorflow.pywrap_tensorflow.IsMklEnabled())"
True

Running Intel AI Quantization tool (Attempting int8 Quantization)

# Exporting some variables which will be used in the next commands
export PB_GRAPH_DIR=<path_to_dir_with_pb_file>

Steps for FP32 Optimized Frozen Graph

  1. Find out the possible input and output node names of the graph
bazel build tensorflow/tools/graph_transforms:summarize_graph
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph \
     --in_graph=$PB_GRAPH_DIR/fp32_inference_graph.pb \
     --print_structure=false

Output:

INFO: Elapsed time: 8.548s, Critical Path: 2.72s
INFO: 2 processes: 2 local.
INFO: Build completed successfully, 7 total actions
Found 1 possible inputs: (name=input_1, type=float(1), shape=[?,?,?,3])
No variables spotted.
Found 3 possible outputs: (name=bboxes, op=Identity) (name=scores, op=Identity) (name=classes, op=Identity)
Found 36529264 (36.53M) const parameters, 0 (0) variable parameters, and 179 control_edges
Op types used: 899 Const, 313 Identity, 111 Conv2D, 108 StridedSlice, 90 Relu, 66 Pack, 58 BiasAdd, 55 Reshape, 53 FusedBatchNorm, 42 Add, 41 Mul, 36 Shape, 20 Range, 19 GatherV2, 17 GatherNd, 17 Pad, 13 Sub, 13 Fill, 12 Enter, 12 Cast, 10 Size, 8 NonMaxSuppressionV2, 8 Greater, 8 Where, 5 TensorArrayV3, 5 ExpandDims, 5 Sigmoid, 5 Rank, 5 Transpose, 5 Tile, 4 ConcatV2, 4 Switch, 4 ClipByValue, 4 NextIteration, 4 Merge, 3 TensorArrayGatherV3, 3 TensorArraySizeV3, 3 TensorArrayWriteV3, 3 Exit, 3 PadV2, 2 ResizeNearestNeighbor, 2 TensorArrayReadV3, 2 TensorArrayScatterV3, 1 Placeholder, 1 Minimum, 1 Maximum, 1 MaxPool, 1 TopKV2, 1 LoopCond, 1 Less

Therefore:

  1. Optimize the model frozen graph:
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
     --in_graph=$PB_GRAPH_DIR/fp32_inference_graph.pb\
     --out_graph=$PB_GRAPH_DIR/fp32_optimized_graph.pb \
     --inputs=input_1 \
     --outputs=bboxes,scores,classes \
     --transforms='fold_batch_norms'

Output:

INFO: Elapsed time: 1.236s, Critical Path: 0.56s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
2019-05-23 12:57:28.241998: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying fold_batch_norms
  1. Run inference using the the optimized graph fp32_optimized_graph.pb and check the model accuracy.
python testing_retinanet.py
Warming-up
image_batch.shape before RetinaNet detector: (1, 800, 960, 3)
Raw input image shape: (1080, 1920, 3)
image_batch.shape before RetinaNet detector: (1, 800, 960, 3)
Predicted in (sec.):    1.25

Raw input image shape: (1080, 1920, 3)
image_batch.shape before RetinaNet detector: (1, 800, 960, 3)
Predicted in (sec.):    0.73

Raw input image shape: (1080, 1920, 3)
image_batch.shape before RetinaNet detector: (1, 800, 960, 3)
Predicted in (sec.):    0.84

Raw input image shape: (1080, 1920, 3)
image_batch.shape before RetinaNet detector: (1, 800, 960, 3)
Predicted in (sec.):    0.94

The above output shows the FP32 inference is working properly.

Steps for Int8 Quantization

  1. Quantize the optimized graph (from step 3) to lower precision using the output node names (from step 1).

Since I was having issues with the quantize_graph.py from master, FAIJUL (@mdfaijul) helped me with another version of quantize_graph.py where he fixed some things. It seems this version of the code can also be found at: https://github.com/NervanaSystems/tools/blob/amin/perchannel/tensorflow_quantization/quantization/quantize_graph.py The update was made around April, 29th of 2019.

python tools-amin-perchannel/tensorflow_quantization/quantization/quantize_graph.py \
--input=$PB_GRAPH_DIR/fp32_optimized_graph.pb \
--output=$PB_GRAPH_DIR/int8_quantized_dynamic_range_graph.pb \
--output_node_names=bboxes,scores,classes \
--mode=eightbit \
--intel_cpu_eightbitize=True

The int8_quantized_dynamic_range_graph.pb is generated successfully.

  1. Convert the quantized graph from dynamic to static re-quantization range. The following steps are to freeze the re-quantization range (also known as calibration):
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
--in_graph=$PB_GRAPH_DIR/int8_quantized_dynamic_range_graph.pb \
--out_graph=$PB_GRAPH_DIR/int8_logged_quantized_graph.pb \
--transforms='insert_logging(op=RequantizationRange, show_name=true, message="__requant_min_max:")'

The int8_logged_quantized_graph.pb is generated successfully.

$ python inference.py 2> min_max_log.txt

Output:

$ head min_max_log.txt
;res2a_branch2a/convolution_eightbit_requant_range__print__;__requant_min_max:[-8.52348137][10.9996948]
;res2a_branch2c/convolution_eightbit_requant_range__print__;__requant_min_max:[-2.56934357][1.80438507]
;res2a_branch1/convolution_eightbit_requant_range__print__;__requant_min_max:[-9.26658058][14.7511406]
;res2b_branch2a/convolution_eightbit_requant_range__print__;__requant_min_max:[-2.35450339][1.34344149]
;res2b_branch2c/convolution_eightbit_requant_range__print__;__requant_min_max:[-2.11334538][1.54288733]
;res2c_branch2a/convolution_eightbit_requant_range__print__;__requant_min_max:[-4.84791231][4.80448294]
;res2c_branch2c/convolution_eightbit_requant_range__print__;__requant_min_max:[-1.70012259][1.6017065]
;res3a_branch2a/convolution_eightbit_requant_range__print__;__requant_min_max:[-4.49989033][2.14688396]
;res3a_branch1/convolution_eightbit_requant_range__print__;__requant_min_max:[-4.16272354][3.04384089]
;res3a_branch2c/convolution_eightbit_requant_range__print__;__requant_min_max:[-3.66644692][4.87620258]
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
--in_graph=$PB_GRAPH_DIR/int8_quantized_dynamic_range_graph.pb \
--out_graph=$PB_GRAPH_DIR/int8_freezed_range_graph.pb \
--transforms='freeze_requantization_ranges(min_max_log_file="<path_to_file>/min_max_log.txt")'

Output:

2019-05-24 06:31:38.146691: I tensorflow/tools/graph_transforms/transform_graph.cc:317] Applying freeze_requantization_ranges
  1. Optimize the quantized graph if needed

I did not perform this step since I still not sure what further optimizations could be applied.

Finally, verifying the quantized model performance and accuracy:

Typically, the accuracy target is the optimized FP32 model accuracy values. The quantized Int8 graph accuracy should not drop more than ~0.5-1%.

Performance Experiments

python testing_retinanet.py --model_path=models/fp32_optimized_graph.pb --dataset_dir=dataset

Output:

Intra_op_paralellism_threads:
Inter_op_paralellism_threads:
Warming-up...
Warm-up done!
Starting inferences.
Elapsed time for one inference: 0.9161639213562012 sec
Elapsed time for one inference: 1.0868263244628906 sec
Elapsed time for one inference: 0.772759199142456 sec
Elapsed time for one inference: 0.6836576461791992 sec
Elapsed time for one inference: 0.7702972888946533 sec
Elapsed time for one inference: 0.8953936100006104 sec
Elapsed time for one inference: 0.8389678001403809 sec
Elapsed time for one inference: 0.6823713779449463 sec
Elapsed time for one inference: 0.6510412693023682 sec
Elapsed time for one inference: 0.7451212406158447 sec
Elapsed time for one inference: 0.6484372615814209 sec
Elapsed time for one inference: 0.8537521362304688 sec
Elapsed time for one inference: 0.6685740947723389 sec
Elapsed time for one inference: 0.7109897136688232 sec
Elapsed time for one inference: 1.0561983585357666 sec
Elapsed time for one inference: 0.6658413410186768 sec
Elapsed time for one inference: 0.9002900123596191 sec
Elapsed time for one inference: 0.6972308158874512 sec
Elapsed time for one inference: 0.6385571956634521 sec
Elapsed time for one inference: 1.000755786895752 sec
Elapsed time for one inference: 0.6200711727142334 sec
Elapsed time for one inference: 1.2095487117767334 sec
Average Inference time: 0.8051293763247404 sec
python testing_retinanet.py --model_path=models/int8_freezed_range_graph.pb --dataset_dir=dataset

Output:

Intra_op_paralellism_threads:
Inter_op_paralellism_threads:
Warming-up...
Warm-up done!
Starting inferences.
Elapsed time for one inference: 1.2494022846221924 sec
Elapsed time for one inference: 1.3705193996429443 sec
Elapsed time for one inference: 1.5473253726959229 sec
Elapsed time for one inference: 1.250992774963379 sec
Elapsed time for one inference: 1.4270391464233398 sec
Elapsed time for one inference: 1.5693747997283936 sec
Elapsed time for one inference: 1.5273106098175049 sec
Elapsed time for one inference: 1.3177094459533691 sec
Elapsed time for one inference: 1.5213563442230225 sec
Elapsed time for one inference: 1.744492530822754 sec
Elapsed time for one inference: 1.7106380462646484 sec
Elapsed time for one inference: 1.330054521560669 sec
Elapsed time for one inference: 1.387904405593872 sec
Elapsed time for one inference: 1.4401581287384033 sec
Elapsed time for one inference: 1.10369873046875 sec
Elapsed time for one inference: 1.448305368423462 sec
Elapsed time for one inference: 1.5787458419799805 sec
Elapsed time for one inference: 1.286379337310791 sec
Elapsed time for one inference: 1.2260711193084717 sec
Elapsed time for one inference: 1.4659216403961182 sec
Elapsed time for one inference: 1.2928223609924316 sec
Elapsed time for one inference: 1.4199872016906738 sec
Average Inference time: 1.4189186096191406 sec

Reference: https://www.tensorflow.org/guide/performance/overview#tuning_mkl_for_the_best_performance

#!/bin/bash
export OMP_NUM_THREADS=48
export MKL_NUM_THREADS=48
export KMP_BLOCKTIME=0
export KMP_SETTINGS=0
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0
export MKLDNN_VERBOSE=0

python testing_retinanet.py --model_path=models/fp32_optimized_graph.pb --dataset_dir=dataset --intra_op=$OMP_NUM_THREADS --inter_op=2

Output:

Intra_op_paralellism_threads: 48
Inter_op_paralellism_threads: 2
Warming-up...
Warm-up done!
Starting inferences.
Elapsed time for one inference: 0.23707938194274902 sec
Elapsed time for one inference: 0.23137187957763672 sec
Elapsed time for one inference: 0.2373485565185547 sec
Elapsed time for one inference: 0.22789597511291504 sec
Elapsed time for one inference: 0.23593854904174805 sec
Elapsed time for one inference: 0.3500180244445801 sec
Elapsed time for one inference: 0.23568296432495117 sec
Elapsed time for one inference: 0.2248241901397705 sec
Elapsed time for one inference: 0.23473024368286133 sec
Elapsed time for one inference: 0.23890209197998047 sec
Elapsed time for one inference: 0.2355351448059082 sec
Elapsed time for one inference: 0.23885583877563477 sec
Elapsed time for one inference: 0.23862934112548828 sec
Elapsed time for one inference: 0.23747849464416504 sec
Elapsed time for one inference: 0.23449015617370605 sec
Elapsed time for one inference: 0.2413957118988037 sec
Elapsed time for one inference: 0.23110103607177734 sec
Elapsed time for one inference: 0.23862242698669434 sec
Elapsed time for one inference: 0.23967623710632324 sec
Elapsed time for one inference: 0.23865604400634766 sec
Elapsed time for one inference: 0.2381117343902588 sec
Elapsed time for one inference: 0.23932194709777832 sec
Average Inference time: 0.24116663499311966 sec

Output:

#!/bin/bash
export OMP_NUM_THREADS=48
export MKL_NUM_THREADS=48
export KMP_BLOCKTIME=0
export KMP_SETTINGS=0
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0
export MKLDNN_VERBOSE=0

python testing_retinanet.py --model_path=models/int8_freezed_range_graph.pb --dataset_dir=dataset --intra_op=$OMP_NUM_THREADS --inter_op=2
Intra_op_paralellism_threads: 48
Inter_op_paralellism_threads: 2
Warming-up...
Warm-up done!
Starting inferences.
Elapsed time for one inference: 0.4379603862762451 sec
Elapsed time for one inference: 0.42247939109802246 sec
Elapsed time for one inference: 0.4269547462463379 sec
Elapsed time for one inference: 0.4161355495452881 sec
Elapsed time for one inference: 0.43010449409484863 sec
Elapsed time for one inference: 0.4226717948913574 sec
Elapsed time for one inference: 0.4171106815338135 sec
Elapsed time for one inference: 0.4259819984436035 sec
Elapsed time for one inference: 0.4207015037536621 sec
Elapsed time for one inference: 0.42037010192871094 sec
Elapsed time for one inference: 0.4402291774749756 sec
Elapsed time for one inference: 0.432023286819458 sec
Elapsed time for one inference: 0.4221925735473633 sec
Elapsed time for one inference: 0.41687941551208496 sec
Elapsed time for one inference: 0.4418454170227051 sec
Elapsed time for one inference: 0.4108130931854248 sec
Elapsed time for one inference: 0.4242119789123535 sec
Elapsed time for one inference: 0.4561805725097656 sec
Elapsed time for one inference: 0.4332873821258545 sec
Elapsed time for one inference: 0.4249701499938965 sec
Elapsed time for one inference: 0.42217087745666504 sec
Elapsed time for one inference: 0.4132516384124756 sec
Average Inference time: 0.42629664594476874 sec

Is VNNI being used?

Checking it using MKLDNN_JIT_DUMP.

#!/bin/bash
export OMP_NUM_THREADS=48
export MKL_NUM_THREADS=48
export KMP_BLOCKTIME=0
export KMP_SETTINGS=0
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0
export MKLDNN_VERBOSE=1
export MKLDNN_JIT_DUMP=1

python testing_retinanet.py --model_path=models/int8_freezed_range_graph.pb --dataset_dir=dataset --intra_op=$OMP_NUM_THREADS --inter_op=2

Running this code produces a bunch of files. As for eg: mkldnn_dump_jit_uni_reorder_kernel_f32.1250.bin , mkldnn_dump_jit_uni_reorder_kernel_f32.987.bin, etc..

Reference: https://intel.github.io/mkl-dnn/perf_profile.html Reference: https://github.com/intelxed/xed

xed -64 -ir mkldnn_dump_jit_avx512_core_x8s8s32x_1x1_conv_fwd_ker_t.9.bin | grep vpdpbusd

Output:

...
XDIS 1c64: AVX512    AVX512EVEX 62F2054050D6             vpdpbusd zmm2, zmm31, zmm6
XDIS 1c71: AVX512    AVX512EVEX 62F2054050DE             vpdpbusd zmm3, zmm31, zmm6
XDIS 1c7e: AVX512    AVX512EVEX 62F2054050E6             vpdpbusd zmm4, zmm31, zmm6
XDIS 1c8b: AVX512    AVX512EVEX 62F2054050EE             vpdpbusd zmm5, zmm31, zmm6
XDIS 1c9f: AVX512    AVX512EVEX 62F2054050C6             vpdpbusd zmm0, zmm31, zmm6
XDIS 1cac: AVX512    AVX512EVEX 62F2054050CE             vpdpbusd zmm1, zmm31, zmm6
XDIS 1cb9: AVX512    AVX512EVEX 62F2054050D6             vpdpbusd zmm2, zmm31, zmm6
XDIS 1cc6: AVX512    AVX512EVEX 62F2054050DE             vpdpbusd zmm3, zmm31, zmm6
XDIS 1cd3: AVX512    AVX512EVEX 62F2054050E6             vpdpbusd zmm4, zmm31, zmm6
XDIS 1ce0: AVX512    AVX512EVEX 62F2054050EE             vpdpbusd zmm5, zmm31, zmm6
...
xed -64 -ir mkldnn_dump_jit_avx512_core_x8s8s32x_conv_fwd_ker_t.391.bin | grep vpdpbusd

Output:

XDIS b5e: AVX512    AVX512EVEX 62123D4050E7             vpdpbusd zmm12, zmm24, zmm31
XDIS b64: AVX512    AVX512EVEX 6212354050EF             vpdpbusd zmm13, zmm25, zmm31
XDIS b6a: AVX512    AVX512EVEX 62122D4050F7             vpdpbusd zmm14, zmm26, zmm31
XDIS b70: AVX512    AVX512EVEX 6212254050FF             vpdpbusd zmm15, zmm27, zmm31
XDIS b76: AVX512    AVX512EVEX 62821D4050C7             vpdpbusd zmm16, zmm28, zmm31
XDIS b7c: AVX512    AVX512EVEX 6282154050CF             vpdpbusd zmm17, zmm29, zmm31
XDIS b8d: AVX512    AVX512EVEX 62823D4050D7             vpdpbusd zmm18, zmm24, zmm31
XDIS b93: AVX512    AVX512EVEX 6282354050DF             vpdpbusd zmm19, zmm25, zmm31
XDIS b99: AVX512    AVX512EVEX 62822D4050E7             vpdpbusd zmm20, zmm26, zmm31
XDIS b9f: AVX512    AVX512EVEX 6282254050EF             vpdpbusd zmm21, zmm27, zmm31
XDIS ba5: AVX512    AVX512EVEX 62821D4050F7             vpdpbusd zmm22, zmm28, zmm31
XDIS bab: AVX512    AVX512EVEX 6282154050FF             vpdpbusd zmm23, zmm29, zmm31

Apparently, It seems VNNI is enabled for at least some operations.

Do you have suggestions on how I could do further investigation in order to make INT8 inference faster than FP32 for this RetinaNet neural network?

Thanks in advance!

Regards,

Feliphe Galiza

nammbash commented 5 years ago

@felipheggaliza Pleasure to be of help.

@mdfaijul in in the process of updating and cleaning the quantize graph.py and I am in the process of integrating the regex feature to this quantize_graph.py. after this integration you can exclude certain specific nodes, or certain nodes following a regex.

nevertheless here is the reason, : not all ops present in all models are quantize and quantize fused ready. So if there is an op which is quantized and the following op cannot be, we need to have a dequnatize operation which at runtime will create a lot of performance loss.

example: For fasterrcnnwith fpn, there are three networks Resnet50 + RPN + FPN. For quantization I exclude the RPN and FPN portion of it so that it is faster and get 1.27x. For your model, you might have to do something like that. For now to exclude certain nodes from quantization.

Hope this helps.

Now: Here is a follow up question: being a user of the quantization tool, how would you want these kind of information intuitive to the user? Suggestions are welcome.

felipheggaliza commented 5 years ago

Hi @nammbash,

I am exciting for the release of the new features you and @mdfaijul are working on!

Thank you for the explanation, now the possible reason is clear to me. I will try to study the graph and see if I can use the same approach you used on FASTER-CNN with FPN. The challenge for me is because I don't really know exactly which operations are optimized and also I don't have a deep knowledge about all operations present in RetinaNet, the graph is very big and it probably will take a lot of time until I figure out which operations I have to exclude from quantization.

Do you know where I can get a list of all operations which are already quantized and quantized fused ready?

Regarding your follow up question, I would say that users are interested in answering questions like:

We could try answering these questions by providing a tool which receives a FP32 model as input and outputs the answers for the above questions. Format could be both stdout, txt and HTML file showing it in a graph.

Hope my suggestion is useful, somehow.

Regards,

Feliphe Galiza

nammbash commented 5 years ago

Updated Qunatize_graph.py https://github.com/NervanaSystems/tools/tree/niroop/perchannel_padfusion_fasterrcnnfpn

From: Feliphe Gonçalves Galiza [mailto:notifications@github.com] Sent: Friday, May 24, 2019 11:46 AM To: IntelAI/tools tools@noreply.github.com Cc: Ammbashankar, Niroopshankar niroopshankar.ammbashankar@intel.com; Mention mention@noreply.github.com Subject: Re: [IntelAI/tools] RetinaNet Quantization (#1)

Hi @nammbashhttps://github.com/nammbash,

I am exciting for the release of the new features you and @mdfaijulhttps://github.com/mdfaijul are working on!

Thank you for the explanation, now the possible reason is clear to me. I will try to study the graph and see if I can use the same approach you used on FASTER-CNN with FPN. The challenge for me is because I don't really know exactly which operations are optimized and also I don't have a deep knowledge about all operations present in RetinaNet, the graph is very big and it probably will take a lot of time until I figure out which operations I have to exclude from quantization.

Do you know where I can get a list of all operations which are already quantized and quatized fused ready?

Regarding your follow up question, I would say that users are interested in answering questions like:

We could try answering these questions by providing a tool which receives a FP32 model as input and outputs the answers for the above questions. Format could be both stdout, txt and HTML file showing it in a graph.

Hope my suggestion is useful, somehow.

Regards,

Feliphe Galiza

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/IntelAI/tools/issues/1?email_source=notifications&email_token=AJIRVNMB7UNK27YYHTQI3PTPXAZXFA5CNFSM4HHRO4YKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWGI5YY#issuecomment-495750883, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJIRVNKR3S7N23JW43PDE5TPXAZXFANCNFSM4HHRO4YA.