Open dacorvo opened 1 week ago
Starting from a QBitsTensor qweight
, the formulas to obtain the data, scales and zero-point seem to be:
data = optimum.quanto.ungroup(qweight._data.unpack(), axis=0, orig_shape=qweight.shape)
scale = qweight._scale.view(qweight.shape[0], -1)
zeropoint = ((2**(4 - 1) - qweight._zeropoint) * qweight._scale).view(qweight.shape[0], -1)
Since
pytorch 2.2
, a new._weight_int4pack_mm
operation is available to perform a matrix multiplication betweenfloat16
/bfloat16
inputs andint4
weights quantized group-wise along the output features axis (which is exactly whatgptq
,awq
andquanto
are doing.This built-in kernel could be used instead of the custom AWQ kernel, that has several restrictions (group_size 128 in particular).
Note: this new operations requires a specific packing of the
int4
data, and uses floating point zeropoint.