pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
1.7k stars 291 forks source link

Questions on deploying Quantized models ... #1141

Open rvijayc opened 10 months ago

rvijayc commented 10 months ago

Hi,

This is more of a question than an issue, but I couldn't find the documentation or source code examples that address this. We have a backend that only supports fixed point operators and I am trying to evaluate using executorch to deploy to our platform. I am new to using Py-Torch as a deployment platform, so please bear with me if my question is too basic.

When I use Py-Torch quantization, I see that it creates a graph in the following format where each operator is sandwiched between dequant and quant ops:

  ... -> dequant -> opX -> quant -> dequant -> opY -> quant -> ...

So, when I use executorch partitioning, is it the expectation that we pattern match dequant -> opX -> quant for lowering into some supported fixed point primitive supported on the backend?

Suppose, I have a Python model of each fixed point op, is there any straightforward way I can run the executorch program directly on Python by substituting the python model for the corresponding lowered module? Since the graph schema is known, it should be possible to do this myself, but wondering if someone already solved this problem.

If I lower the entire graph onto the backend as a single lowered module, I suppose that the memory planning doesn't apply inside the lowered module - i.e., the lowered module needs to take care of memory planning of tensors inside the module?

Finally, is there an example that shows how I can pass already quantized inputs to the executorch program? For example, if I use fixed quantization for inputs and outputs, clients can directly pass quantized inputs and outputs without the need to deal with floating point data. Is this possible with executorch?

Appreciate your help with my questions. This is an impressive platform!

Thanks, Vijay.

kimishpatel commented 10 months ago

So, when I use executorch partitioning, is it the expectation that we pattern match dequant -> opX -> quant for lowering into some supported fixed point primitive supported on the backend?

That is correct. However, there is some WIP to represent quantized ops via integer compute instead "dq -> op -> q". See here https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html#convert-the-calibrated-model-to-a-quantized-model

Suppose, I have a Python model of each fixed point op, is there any straightforward way I can run the executorch program directly on Python by substituting the python model for the corresponding lowered module? Since the graph schema is known, it should be possible to do this myself, but wondering if someone already solved this problem.

Are you trying to use the export pipeline and generate executorch model (.pte file) and run it in python enviornment? If so yes, but this requires python bindings which is being enabled (or may already be). @larryliu0820 I saw you land some diffs for this.

If I lower the entire graph onto the backend as a single lowered module, I suppose that the memory planning doesn't apply inside the lowered module - i.e., the lowered module needs to take care of memory planning of tensors inside the module?

This is correct. It might be possible to leverage memory planning, just got an idea of the arena that needs to be allocated and the tensor offsets. You might want to file feature request if you need this.

Finally, is there an example that shows how I can pass already quantized inputs to the executorch program? For example, if I use fixed quantization for inputs and outputs, clients can directly pass quantized inputs and outputs without the need to deal with floating point data. Is this possible with executorch?

Answer to this is yes. Imagine you have quantized model that is:

def my_model(input):
q = quantize(input)
dq = dquantize(q)
c = conv(dq)
cq = quantize(c)
cdq = dquantize(cq)
return cdq

Now say you delegate quantized conv to your backend so you have

def my_model(input):
q = quantize(input)
cq = call_delegate(q)
cdq = dquantize(cq)
return cdq

This hasnt been tested, but in theory it should be possible for you to rewrite this graph to remove q/dq nodes to hat what you have is

def my_model(input_q):
cq = call_delegate(input_q)
return cq

Since we are changing dtypes for the input and outputs, two things need to be considered:

  1. you may hve to inform the ExportedProgram, the structure that contains exported graph, that the input dtypes have change. I am not 100% certain but this should be possilble. I will try to write up an example if I can.
  2. you are losing quantization information attached to q/dq nodes. So you need to make sure this is "saved" if you need.
rvijayc commented 10 months ago

Thank you so much for this information. This is very helpful.

Are you trying to use the export pipeline and generate executorch model (.pte file) and run it in python enviornment? If so yes, but this requires python bindings which is being enabled (or may already be). @larryliu0820 I saw you land some diffs for this.

Yes. This is one way of doing this and I think it should work for me (I was probably thinking of running it at a earlier stage of the compilation process, but running a .pte in Python using bindings is good as well).

This is correct. It might be possible to leverage memory planning, just got an idea of the arena that needs to be allocated and the tensor offsets. You might want to file feature request if you need this.

I'll try to play around with this to see if there is any way I can take advantage of the existing memory planning code. I'll raise a feature request if necessary.

you are losing quantization information attached to q/dq nodes. So you need to make sure this is "saved" if you need.

Correct. My plan for this would be use use "fixed" quantization (example, Q15) for input and output Q/DQ with the quantization scales and biases implicitly known. This way the entire inference is executed purely using integers.

larryliu0820 commented 10 months ago

Yes I’m trying to land a PR #1006 to add pybind support. Currently running into some errors on macos still debugging

kimishpatel commented 10 months ago

Correct. My plan for this would be use use "fixed" quantization (example, Q15) for input and output Q/DQ with the quantization scales and biases implicitly known. This way the entire inference is executed purely using integers.

That sounds good. Although do note that we dont really have fixed point dtype the way you have specified. So I would like to learn how would you leverage the PT 2 quantization to achieve your objectives. Maybe best to create pytorch forums post here https://discuss.pytorch.org/c/executorch/42 for further discussion on fixed point quantization.

@larryliu0820 once you have the PR landed, we can close this.

rvijayc commented 10 months ago

Maybe best to create pytorch forums post here https://discuss.pytorch.org/c/executorch/42 for further discussion on fixed point quantization.

Yes. I'll start a discussion on this once I have a solidified proposal. The Q-formats are simply special cases of (scale, zero point, dtype) based affine quantization where zero-point is always 0 and scale is a power of 2 (i.e., Q8 in affine representation is (scale=2^7, zero-point-0, dtype=int8). So, existing Py-Torch quantization framework will still work, with FixedQParamsQuantizationSpec that allows me to explicitly define quant params for inputs and outputs.

kimishpatel commented 10 months ago

scale is a power of 2

Torch quantization framework will still work, with FixedQParamsQuantizationSpec that

thats great.

Although, I would like to understand how 2^7 will translate into Q8, or you meant Q15? For Q15 it makes sense as the fractional part is really dividing by 2^7?

rvijayc commented 10 months ago

Although, I would like to understand how 2^7 will translate into Q8, or you meant Q15? For Q15 it makes sense as the fractional part is really dividing by 2^7?

I think I goofed. What I meant was Q0.7 - where there are 7 fractional bits, 0 integer bits and 1 sign bit. This corresponds to (2^7, 0, int8). There are different variations of the notation and we have played fast and loose with this - for example, sometimes we use Q7 to automatically refer to Q0.7.

The key thing is that we'll standardize the input and output Q-formats for a given model which are known to the client and this allows us to use a 100% fixed-point data path where it will be the clients responsibility to do the input/output quantization.