Open Jerry-Ge opened 11 months ago
Request to enable a simple way to support fused quantized operators
Not sure it this fits with the existing PT2 quant flow. Can you do such fusion post partitioning in the preprocess()?
The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references
And for precision issue, where do you expect such PyTorch model to run when comparing against TOSA? PyTorch eager model would be doing something like this, from quant arithmetic point of view, by running qadd on a backend like QNNPACK.
CC @jerryzh168, @kimishpatel
@Jerry-Ge are you requesting this for partitioning purposes? You can add quantization_tag during annotation in the quantizer. For example if you add node.meta["quantization_tag"] = "my_q_add" on fp32 add then all three nodes in [dq -> fp32 add -> q] will have that quantization tag on it.
The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references
FOr reference representation there are some operators that have integr decomposition and if those align with numerics of TOSA, you can generate qutnized model during "convert" step to use reference implementation. https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/quantize_pt2e.py#L203
I have very similar problem where I need to match fused operators with quantization enabled. Typically, I need to match the following pattern in the edge graph in between the brackets ...
x -> q -> { dq -> op1 -> q -> dq -> op2 -> q } -> dq -> y
... and do two things with it:
torch.ops.<my_hw>.my_op
which will execute a bit-exact quantized reference version of the fused op on x86 in a torch environment. I would use it to study quantization performance against the fused float version and also the overall model accuracy. The dq
and q
nodes at the border have the necessary scale/zero-point data that I can use in my own reference op. The torch reference quantized version won't work for me as it won't match the numerics of the actual HW.Is building a custom quantizer the right way to go about it? With the customized quantizer, I'll have a dq -> op1 -> op2 -> q
pattern, and I can easily match the op1 -> op2
pattern via get_source_partitions
. Otherwise, the intervening dq
and q
nodes makes pattern matching of fused ops quite complicated.
If there a simpler/canonical way to accomplish this, please let me know.
Thank You!
I have very similar problem where I need to match fused operators with quantization enabled. Typically, I need to match the following pattern in the edge graph in between the brackets ...
x -> q -> { dq -> op1 -> q -> dq -> op2 -> q } -> dq -> y
... and do two things with it:
- Replace it with
torch.ops.<my_hw>.my_op
which will execute a bit-exact quantized reference version of the fused op on x86 in a torch environment. I would use it to study quantization performance against the fused float version and also the overall model accuracy. Thedq
andq
nodes at the border have the necessary scale/zero-point data that I can use in my own reference op. The torch reference quantized version won't work for me as it won't match the numerics of the actual HW.- Delegate the same pattern to my backend and fused op that will run on real HW.
Is building a custom quantizer the right way to go about it? With the customized quantizer, I'll have a
dq -> op1 -> op2 -> q
pattern, and I can easily match theop1 -> op2
pattern viaget_source_partitions
. Otherwise, the interveningdq
andq
nodes makes pattern matching of fused ops quite complicated.If there a simpler/canonical way to accomplish this, please let me know.
Thank You!
For matching arbitrary pattern, you can use
Each has its pros and cons.
Orthogonally, question I have is, why do you quantize op1 and op2 separately, when you have fused quantized op that is op1->op2. Can you not quantize them together which will be more accurte
Request to enable a simple way to support fused quantized operators
Not sure it this fits with the existing PT2 quant flow. Can you do such fusion post partitioning in the preprocess()?
The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references
And for precision issue, where do you expect such PyTorch model to run when comparing against TOSA? PyTorch eager model would be doing something like this, from quant arithmetic point of view, by running qadd on a backend like QNNPACK.
CC @jerryzh168, @kimishpatel
Thanks Digant!
Not sure it this fits with the existing PT2 quant flow. Can you do such fusion post partitioning in the preprocess()?
I think I can. Everything is working right now. My goal of this ticket is to have something simpler that we can use directly : )
And for precision issue, where do you expect such PyTorch model to run when comparing against TOSA?
@Jerry-Ge are you requesting this for partitioning purposes? You can add quantization_tag during annotation in the quantizer. For example if you add node.meta["quantization_tag"] = "my_q_add" on fp32 add then all three nodes in [dq -> fp32 add -> q] will have that quantization tag on it.
Thanks Kimish! Not really, my very simple goal is to directly read those scale
and zp
info when visiting the quant_add node so I don't need to walk back.
The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references
FOr reference representation there are some operators that have integr decomposition and if those align with numerics of TOSA, you can generate qutnized model during "convert" step to use reference implementation. https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/quantize_pt2e.py#L203
Ah, I don't know about this. Let me try it.
For matching arbitrary pattern, you can use
- get_source_partition and friends
- quantization_tag as I mentioned above (please let me know if thts not clear)
- Use your own subgraph matcher. SubgraphMatcher is a util in pytorch. To do this you will write pythong module or callable with the pattern you expect. Then quantize it and export it. This gives you a graph taht you can match. ANd replace this with pattern that you want. See some examples here and let us know if it is not clear https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/pt2e/qat_utils.py#L554
Each has its pros and cons.
Thanks. I ended up using SubgraphMatcher (using https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/pt2e/representation/rewrite.py as an example). It worked fine except for the need to rewrite remove_tensor_overload_for_qdq_ops
using edge ops (exir_ops.edge.quantized_decomposed.quantize_per_tensor
instead of torch.ops.quantized_decomposed.quantize_per_tensor
).
The pattern I am matching is: q -> { dq -> op1 -> op2 -> q } -> dq
, with the pattern inside {}
replaced by my fixed-point custom HW op model.
The quantization tag is also clear to me. I'll try this as well to see if this works better.
Orthogonally, question I have is, why do you quantize op1 and op2 separately, when you have fused quantized op that is op1->op2. Can you not quantize them together which will be more accurte
Yes. Originally, I didn't plan to write my own quantizer (I was using XNNPACK earlier). I ended up deciding to write a customized quantizer that correctly matches our custom fusing and this works well.
Can we close this?+
Can we close this?+
thanks Digant for bringing this up. I haven't touched on this for a while. Let me go back to this and be back to you later.
So the short conclusion is: let's keep it open for now.
@Jerry-Ge Any action item on this issue? If not, can we close?
Background and Request:
In the current quantization flow (e.g., using XNNPack Quantizer), a quantized operator in edge dialect is something like
Request to enable a simple way to support fused quantized operators like
QUANT_ADD(input_A_int8 (scale, zp), input_B_int8 (scale, zp))
to get rid of those Q and DQ nodes.Two benefits: