huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
645 stars 36 forks source link

Use `torch.ops.aten._weight_int4pack_mm` for W4A16 inference #218

Open dacorvo opened 1 week ago

dacorvo commented 1 week ago

Since pytorch 2.2, a new ._weight_int4pack_mm operation is available to perform a matrix multiplication between float16 / bfloat16 inputs and int4 weights quantized group-wise along the output features axis (which is exactly what gptq, awq and quanto are doing.

This built-in kernel could be used instead of the custom AWQ kernel, that has several restrictions (group_size 128 in particular).

Note: this new operations requires a specific packing of the int4 data, and uses floating point zeropoint.

dacorvo commented 1 week ago

Starting from a QBitsTensor qweight, the formulas to obtain the data, scales and zero-point seem to be:

data = optimum.quanto.ungroup(qweight._data.unpack(), axis=0, orig_shape=qweight.shape)
scale = qweight._scale.view(qweight.shape[0], -1)
zeropoint = ((2**(4 - 1) - qweight._zeropoint) * qweight._scale).view(qweight.shape[0], -1)