Support of Fused Quantized Operators

Jerry-Ge commented 11 months ago

Background and Request:

In the current quantization flow (e.g., using XNNPack Quantizer), a quantized operator in edge dialect is something like

Q->DQ->[ADD_FP32]->Q->DQ

Request to enable a simple way to support fused quantized operators like QUANT_ADD(input_A_int8 (scale, zp), input_B_int8 (scale, zp)) to get rid of those Q and DQ nodes.

Two benefits:

This can significantly simplify the lowering code (at least for TOSA backend)
The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references (https://github.com/pytorch/executorch/blob/91b2a6187e95c64522fff8d24f2a8fdb49e8aef9/examples/arm/arm_tosa_e2e.py#L252)

digantdesai commented 11 months ago

Request to enable a simple way to support fused quantized operators

Not sure it this fits with the existing PT2 quant flow. Can you do such fusion post partitioning in the preprocess()?

The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references

And for precision issue, where do you expect such PyTorch model to run when comparing against TOSA? PyTorch eager model would be doing something like this, from quant arithmetic point of view, by running qadd on a backend like QNNPACK.

CC @jerryzh168, @kimishpatel

kimishpatel commented 11 months ago

@Jerry-Ge are you requesting this for partitioning purposes? You can add quantization_tag during annotation in the quantizer. For example if you add node.meta["quantization_tag"] = "my_q_add" on fp32 add then all three nodes in [dq -> fp32 add -> q] will have that quantization tag on it.

kimishpatel commented 11 months ago

The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references

FOr reference representation there are some operators that have integr decomposition and if those align with numerics of TOSA, you can generate qutnized model during "convert" step to use reference implementation. https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/quantize_pt2e.py#L203

rvijayc commented 11 months ago

I have very similar problem where I need to match fused operators with quantization enabled. Typically, I need to match the following pattern in the edge graph in between the brackets ...

x -> q -> { dq -> op1 -> q -> dq -> op2 -> q } -> dq  -> y

... and do two things with it:

Replace it with torch.ops.<my_hw>.my_op which will execute a bit-exact quantized reference version of the fused op on x86 in a torch environment. I would use it to study quantization performance against the fused float version and also the overall model accuracy. The dq and q nodes at the border have the necessary scale/zero-point data that I can use in my own reference op. The torch reference quantized version won't work for me as it won't match the numerics of the actual HW.
Delegate the same pattern to my backend and fused op that will run on real HW.

Is building a custom quantizer the right way to go about it? With the customized quantizer, I'll have a dq -> op1 -> op2 -> q pattern, and I can easily match the op1 -> op2 pattern via get_source_partitions. Otherwise, the intervening dq and q nodes makes pattern matching of fused ops quite complicated.

If there a simpler/canonical way to accomplish this, please let me know.

Thank You!

kimishpatel commented 11 months ago

I have very similar problem where I need to match fused operators with quantization enabled. Typically, I need to match the following pattern in the edge graph in between the brackets ...
x -> q -> { dq -> op1 -> q -> dq -> op2 -> q } -> dq  -> y
... and do two things with it:

Replace it with torch.ops.<my_hw>.my_op which will execute a bit-exact quantized reference version of the fused op on x86 in a torch environment. I would use it to study quantization performance against the fused float version and also the overall model accuracy. The dq and q nodes at the border have the necessary scale/zero-point data that I can use in my own reference op. The torch reference quantized version won't work for me as it won't match the numerics of the actual HW.

Delegate the same pattern to my backend and fused op that will run on real HW.

Is building a custom quantizer the right way to go about it? With the customized quantizer, I'll have a dq -> op1 -> op2 -> q pattern, and I can easily match the op1 -> op2 pattern via get_source_partitions. Otherwise, the intervening dq and q nodes makes pattern matching of fused ops quite complicated.

If there a simpler/canonical way to accomplish this, please let me know.

Thank You!

For matching arbitrary pattern, you can use

get_source_partition and friends
quantization_tag as I mentioned above (please let me know if thts not clear)
Use your own subgraph matcher. SubgraphMatcher is a util in pytorch. To do this you will write pythong module or callable with the pattern you expect. Then quantize it and export it. This gives you a graph taht you can match. ANd replace this with pattern that you want. See some examples here and let us know if it is not clear https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/pt2e/qat_utils.py#L554

Each has its pros and cons.

Orthogonally, question I have is, why do you quantize op1 and op2 separately, when you have fused quantized op that is op1->op2. Can you not quantize them together which will be more accurte

Jerry-Ge commented 11 months ago

Request to enable a simple way to support fused quantized operators

Not sure it this fits with the existing PT2 quant flow. Can you do such fusion post partitioning in the preprocess()?

The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references

And for precision issue, where do you expect such PyTorch model to run when comparing against TOSA? PyTorch eager model would be doing something like this, from quant arithmetic point of view, by running qadd on a backend like QNNPACK.

CC @jerryzh168, @kimishpatel

Thanks Digant!

Not sure it this fits with the existing PT2 quant flow. Can you do such fusion post partitioning in the preprocess()?

I think I can. Everything is working right now. My goal of this ticket is to have something simpler that we can use directly : )

And for precision issue, where do you expect such PyTorch model to run when comparing against TOSA?

here? https://github.com/pytorch/executorch/blob/9a580315613bdfda9d3d4d16d5ddb969b29fbfad/backends/arm/test/test_tosa.py#L99

Jerry-Ge commented 11 months ago

@Jerry-Ge are you requesting this for partitioning purposes? You can add quantization_tag during annotation in the quantizer. For example if you add node.meta["quantization_tag"] = "my_q_add" on fp32 add then all three nodes in [dq -> fp32 add -> q] will have that quantization tag on it.

Thanks Kimish! Not really, my very simple goal is to directly read those scale and zp info when visiting the quant_add node so I don't need to walk back.

Jerry-Ge commented 11 months ago

The support of running PyTorch models natively in quantized representation can really solve the precision issue between PyTorch and TOSA references

FOr reference representation there are some operators that have integr decomposition and if those align with numerics of TOSA, you can generate qutnized model during "convert" step to use reference implementation. https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/quantize_pt2e.py#L203

Ah, I don't know about this. Let me try it.

rvijayc commented 11 months ago

For matching arbitrary pattern, you can use

get_source_partition and friends

quantization_tag as I mentioned above (please let me know if thts not clear)

Use your own subgraph matcher. SubgraphMatcher is a util in pytorch. To do this you will write pythong module or callable with the pattern you expect. Then quantize it and export it. This gives you a graph taht you can match. ANd replace this with pattern that you want. See some examples here and let us know if it is not clear https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/pt2e/qat_utils.py#L554

Each has its pros and cons.

Thanks. I ended up using SubgraphMatcher (using https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/pt2e/representation/rewrite.py as an example). It worked fine except for the need to rewrite remove_tensor_overload_for_qdq_ops using edge ops (exir_ops.edge.quantized_decomposed.quantize_per_tensor instead of torch.ops.quantized_decomposed.quantize_per_tensor).

The pattern I am matching is: q -> { dq -> op1 -> op2 -> q } -> dq, with the pattern inside {} replaced by my fixed-point custom HW op model.

The quantization tag is also clear to me. I'll try this as well to see if this works better.

Orthogonally, question I have is, why do you quantize op1 and op2 separately, when you have fused quantized op that is op1->op2. Can you not quantize them together which will be more accurte

Yes. Originally, I didn't plan to write my own quantizer (I was using XNNPACK earlier). I ended up deciding to write a customized quantizer that correctly matches our custom fusing and this works well.

digantdesai commented 9 months ago

Can we close this?+

Jerry-Ge commented 9 months ago

Can we close this?+

thanks Digant for bringing this up. I haven't touched on this for a while. Let me go back to this and be back to you later.

So the short conclusion is: let's keep it open for now.

mergennachin commented 5 months ago

@Jerry-Ge Any action item on this issue? If not, can we close?

pytorch / executorch

Support of Fused Quantized Operators #1230

Background and Request:

Two benefits: