huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
645 stars 36 forks source link

Add `torch.ops.aten._weight_int8pack_mm` for W8A16 inference #219

Closed dacorvo closed 4 days ago

dacorvo commented 1 week ago

Since pytorch 2.3, a new _weight_int8pack_mm operation is available to perform a matrix multiplication between float16 / bfloat16 inputs and int8 weights symmetrically quantized along the output features axis (which is exactly what quanto is doing.

This built-in kernel could be used instead of dqmm.

Note: it is only available for now on CPU and MPS devices.