Ideally, I think we should unify them, it might complicate the operator pattern that’s used by backends like xnnpack, but the code sharing and simplification of the representation it brings will be beneficial in the long term.
We defined three functions: choose_qparams_affine_per_block, quantize_affine_per_block, dequantize_affine_per_block, please checkout the docstrings of these functions in the PR for the definitions
Some Questions
for input and scale/zero_point, what do we do when they have different dtypes, e.g. when input is fp16, scales and zero_points are fp32? do we always convert to fp32 and then do the computation?
Concerns about using torch.Tensor for per_tensor quantization instead of float/int numbers?
It may run slower, is there any concerns on perf?
Other ways to choose qparams apart from symmetric and asymmetric?
clampping for quant_min/quant_max, should we include this in the quantize op or leave this out?
I'm also thinking of API for end users, I think we could provide a util function to get the block size, e.g. get_block_size(input, {"quant_type": "per_channel_group", "group_size": 32, "axis": -1})
PR is here, please feel free to comment in PR directly: https://github.com/pytorch-labs/ao/pull/159
Context
Currently there are many q/dq functions in torchao and pytorch, they mainly differ in the following dimensions:
Ideally, I think we should unify them, it might complicate the operator pattern that’s used by backends like xnnpack, but the code sharing and simplification of the representation it brings will be beneficial in the long term.
We defined three functions: choose_qparams_affine_per_block, quantize_affine_per_block, dequantize_affine_per_block, please checkout the docstrings of these functions in the PR for the definitions
Some Questions
get_block_size(input, {"quant_type": "per_channel_group", "group_size": 32, "axis": -1})