[RFC] More general affine quantization primitives

PR is here, please feel free to comment in PR directly: https://github.com/pytorch-labs/ao/pull/159

Context

Currently there are many q/dq functions in torchao and pytorch, they mainly differ in the following dimensions:

dtype/bitwidth + quant_min/quant_max: e.g. torch.uint8 with quant_min=0 and quant_max = 255
symmetric/asymmetric quantization
granularity: per_tensor, per_channel, per_channel_group
dtype for scales and zero_points

Ideally, I think we should unify them, it might complicate the operator pattern that’s used by backends like xnnpack, but the code sharing and simplification of the representation it brings will be beneficial in the long term.

We defined three functions: choose_qparams_affine_per_block, quantize_affine_per_block, dequantize_affine_per_block, please checkout the docstrings of these functions in the PR for the definitions

Some Questions

for input and scale/zero_point, what do we do when they have different dtypes, e.g. when input is fp16, scales and zero_points are fp32? do we always convert to fp32 and then do the computation?
Concerns about using torch.Tensor for per_tensor quantization instead of float/int numbers?
It may run slower, is there any concerns on perf?
Other ways to choose qparams apart from symmetric and asymmetric?
clampping for quant_min/quant_max, should we include this in the quantize op or leave this out?
I'm also thinking of API for end users, I think we could provide a util function to get the block size, e.g. get_block_size(input, {"quant_type": "per_channel_group", "group_size": 32, "axis": -1})

pytorch / ao

[RFC] More general affine quantization primitives #160

Context

Some Questions