Add `torch.ops.aten._weight_int8pack_mm` for W8A16 inference

Since pytorch 2.3, a new _weight_int8pack_mm operation is available to perform a matrix multiplication between float16 / bfloat16 inputs and int8 weights symmetrically quantized along the output features axis (which is exactly what quanto is doing.

This built-in kernel could be used instead of dqmm.

Note: it is only available for now on CPU and MPS devices.

huggingface / optimum-quanto

Add `torch.ops.aten._weight_int8pack_mm` for W8A16 inference #219