Open ylep opened 11 months ago
Dear Dr. Leprince, It is, I believe, linked to: https://github.com/neuralmagic/sparseml/issues/733
I'll have to check if they added support for TConv. After that, I'll check if I can update the training code (and publish it on github) 👍
Ohhh so this is a duplicate of https://github.com/clementpoiret/HSF/issues/22, silly me :frowning_face:. Feel free to close this issue, or the previous one, so that we have a single place for tracking the progress.
Anyway, thanks for the reply! In the meantime I will deploy the non-sparse models as a default in NeuroSpin
Np :) It's always a pleasure to read a message from Dr. Leprince 😁
Anyway, in all apps, I think sparse/optimized networks should always be optional as they rely on very recent hardware, which most do not have...
Little update on the issue. I have to try but I made an easy way to do Quantization Aware Training and Neural Pruning using Intel(R) Neural Compressor. This should work OOB:
Also, to quote ONNXRuntime:
The quantized values are 8 bits wide and can be either signed (int8) or unsigned (uint8). We can choose the signedness of the activations and the weights separately, so the data format can be (activations: uint8, weights: uint8), (activations: uint8, weights: int8), etc. Let’s use U8U8 as a shorthand for (activations: uint8, weights: uint8), U8S8 for (activations: uint8, weights: int8), and similarly S8U8 and S8S8 for the remaining two formats.
ONNX Runtime quantization on CPU can run U8U8, U8S8 and S8S8. S8S8 with QDQ is the default setting and balances performance and accuracy. It should be the first choice. Only in cases that the accuracy drops a lot, you can try U8U8. Note that S8S8 with QOperator will be slow on x86-64 CPUs and should be avoided in general. ONNX Runtime quantization on GPU only supports S8S8.
WHEN AND WHY DO I NEED TO TRY U8U8? On x86-64 machines with AVX2 and AVX512 extensions, ONNX Runtime uses the VPMADDUBSW instruction for U8S8 for performance. This instruction might suffer from saturation issues: it can happen that the output does not fit into a 16-bit integer and has to be clamped (saturated) to fit. Generally, this is not a big issue for the final result. However, if you do encounter a large accuracy drop, it may be caused by saturation. In this case, you can either try reduce_range or the U8U8 format which doesn’t have saturation issues.
There is no such issue on other CPU architectures (x64 with VNNI and ARM).
Describe the bug
Hi Dr @clementpoiret! Now that you have graduated :tada: here is a technical issue to keep you busy :wink:
On a workstation with AVX512 and VNNI CPU capabilities, I am getting the following message:
The performance is indeed worse than the non-sparse model (although I am not sure how it is counting CPU-time here w.r.t. HyperThreading):
segmentation=bagging_sq hardware=deepsparse
hardware=onnxruntime model=bagging_accurate hardware.engine_settings.execution_providers="['CPUExecutionProvider']
).Environment
segmentation=bagging_sq hardware=deepsparse