Since pytorch 2.3, a new _weight_int8pack_mm operation is available to perform a matrix multiplication between float16 / bfloat16 inputs and int8 weights symmetrically quantized along the output features axis (which is exactly what quanto is doing.
This built-in kernel could be used instead of dqmm.
Note: it is only available for now on CPU and MPS devices.
Since
pytorch 2.3
, a new_weight_int8pack_mm
operation is available to perform a matrix multiplication betweenfloat16
/bfloat16
inputs andint8
weights symmetrically quantized along the output features axis (which is exactly whatquanto
is doing.This built-in kernel could be used instead of
dqmm
.Note: it is only available for now on CPU and MPS devices.