Sparse-quantized model runs without VNNI acceleration

ylep commented 11 months ago

Describe the bug

Hi Dr @clementpoiret! Now that you have graduated :tada: here is a technical issue to keep you busy :wink:

On a workstation with AVX512 and VNNI CPU capabilities, I am getting the following message:

DeepSparse Optimization Status (minimal: AVX2 | partial: AVX512 | full: AVX512 VNNI): full
[nm_ort 7fda254c7280 >WARN<  is_supported_graph src/onnxruntime_neuralmagic/supported/ops.cc:134] Warning: Optimized runtime disabled - Detecte
d dynamic input input dim 2. Set inputs to static shapes to enable optimal performance.

The performance is indeed worse than the non-sparse model (although I am not sure how it is counting CPU-time here w.r.t. HyperThreading):

6 min 52 s wall-time / 40 min 56 s user CPU-time for segmentation=bagging_sq hardware=deepsparse
vs 4 min 16 s wall-time / 79 min 29 s user CPU-time for hardware=onnxruntime model=bagging_accurate hardware.engine_settings.execution_providers="['CPUExecutionProvider']).

Environment

OS: Ubuntu 22.04
Python: 3.10.12
HSF Version: 1.1.3

Relevant settings: segmentation=bagging_sq hardware=deepsparse

Versions of a few relevant dependencies:

deepsparse==1.5.3
onnx==1.12.0
onnxruntime==1.16.1
onnxruntime-gpu==1.16.1
sparsezoo==1.5.2
torch==2.1.0
torchio==0.18.92

clementpoiret commented 11 months ago

Dear Dr. Leprince, It is, I believe, linked to: https://github.com/neuralmagic/sparseml/issues/733

I'll have to check if they added support for TConv. After that, I'll check if I can update the training code (and publish it on github) 👍

ylep commented 11 months ago

Ohhh so this is a duplicate of https://github.com/clementpoiret/HSF/issues/22, silly me :frowning_face:. Feel free to close this issue, or the previous one, so that we have a single place for tracking the progress.

Anyway, thanks for the reply! In the meantime I will deploy the non-sparse models as a default in NeuroSpin

clementpoiret commented 11 months ago

Np :) It's always a pleasure to read a message from Dr. Leprince 😁

Anyway, in all apps, I think sparse/optimized networks should always be optional as they rely on very recent hardware, which most do not have...

clementpoiret commented 10 months ago

Little update on the issue. I have to try but I made an easy way to do Quantization Aware Training and Neural Pruning using Intel(R) Neural Compressor. This should work OOB:

https://github.com/clementpoiret/lightning-nc/

clementpoiret commented 10 months ago

Also, to quote ONNXRuntime:

The quantized values are 8 bits wide and can be either signed (int8) or unsigned (uint8). We can choose the signedness of the activations and the weights separately, so the data format can be (activations: uint8, weights: uint8), (activations: uint8, weights: int8), etc. Let’s use U8U8 as a shorthand for (activations: uint8, weights: uint8), U8S8 for (activations: uint8, weights: int8), and similarly S8U8 and S8S8 for the remaining two formats.

ONNX Runtime quantization on CPU can run U8U8, U8S8 and S8S8. S8S8 with QDQ is the default setting and balances performance and accuracy. It should be the first choice. Only in cases that the accuracy drops a lot, you can try U8U8. Note that S8S8 with QOperator will be slow on x86-64 CPUs and should be avoided in general. ONNX Runtime quantization on GPU only supports S8S8.

WHEN AND WHY DO I NEED TO TRY U8U8? On x86-64 machines with AVX2 and AVX512 extensions, ONNX Runtime uses the VPMADDUBSW instruction for U8S8 for performance. This instruction might suffer from saturation issues: it can happen that the output does not fit into a 16-bit integer and has to be clamped (saturated) to fit. Generally, this is not a big issue for the final result. However, if you do encounter a large accuracy drop, it may be caused by saturation. In this case, you can either try reduce_range or the U8U8 format which doesn’t have saturation issues.

There is no such issue on other CPU architectures (x64 with VNNI and ARM).

neurospin / HSF

Sparse-quantized model runs without VNNI acceleration #1