[Documentation] Int4 Blockwise Quantization Documentation

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.8k stars 2.94k forks source link

[Documentation] Int4 Blockwise Quantization Documentation #17122

Open zjc664656505 opened 1 year ago

zjc664656505 commented 1 year ago

Describe the documentation issue

I recently see onnxruntime adds the int4 blockwise quantization feature and wish to know whether there is any associated documentation regarding this very new feature for us to build and test it?

Page / URL

No response

trajepl commented 1 year ago

Gentle ping. I tried with Intel/bert-base-uncased-mrpc(418M), I can get the quantized model(145MB) but can not initialize the inference session with the quantized model. The failure shows that: "[ShapeInferenceError] 4b quantization not yet supported on this hardware platform!"

Which kind of hardware platform will be OK for 4 bit quantization? Could you help show more details?

trajepl commented 1 year ago

Reproduce step:

install latest ort-nightly package.
pip install git+https://github.com/microsoft/Olive#egg=olive-ai[cpu]
git clone https://github.com/microsoft/Olive; git checkout ort_passes/matmul_4bit_quant
cd examples/bert
run python -m olive.workflows.run --config bert_ptq_cpu_4bit_quant.json or debug with following code:
```
from olive.workflows import run as olive_run
olive_run("./bert_ptq_cpu_4bit_quant.json")
```

Jay19751103 commented 1 year ago

I also have same problem to inference 4 bit onnx model. It will check processor features to assign implementation .
// // Check if the processor supports AVX512 core features // (AVX512BW/AVX512DQ/AVX512VL). //

https://github.com/microsoft/onnxruntime/blob/209b6dbd975efbc792b5ca9ae1dd74b828559148/onnxruntime/core/mlas/lib/platform.cpp#L401