Use `torch.ops.aten._weight_int4pack_mm` for W4A16 inference

huggingface / optimum-quanto

A pytorch quantization backend for optimum

Apache License 2.0

645 stars 36 forks source link

Since pytorch 2.2, a new ._weight_int4pack_mm operation is available to perform a matrix multiplication between float16 / bfloat16 inputs and int4 weights quantized group-wise along the output features axis (which is exactly what gptq, awq and quanto are doing.

This built-in kernel could be used instead of the custom AWQ kernel, that has several restrictions (group_size 128 in particular).

Note: this new operations requires a specific packing of the int4 data, and uses floating point zeropoint.

data = optimum.quanto.ungroup(qweight._data.unpack(), axis=0, orig_shape=qweight.shape) scale = qweight._scale.view(qweight.shape[0], -1) zeropoint = ((2**(4 - 1) - qweight._zeropoint) * qweight._scale).view(qweight.shape[0], -1)

huggingface / optimum-quanto

Use `torch.ops.aten._weight_int4pack_mm` for W4A16 inference #218