Add support for int8 input/output

freddan80 commented 2 months ago

🚀 The feature, motivation and pitch

Background: Fp32 arithmetic typically is avoided in the embedded (microcontroller) domain, due to tight cycle and memory constraints. Hence, sensors usually produce integer data. Therefore, the input/output to an int8-quantized-NN should ideally be of integer dtype (int8) in order to save cycles and memory.

Current behavior: Input/output is always fp32. Example:

       fp32   int8                      int8    fp32
input  --  q  --  accelerated subgraph  --  dq  --  output

Notes: • In this example, “accelerated subgraph” is a node (subgraph) delegated to e.g. an NPU such as Ethos-U. • For the Arm TOSA delegate, we have implemented a workaround (https://github.com/pytorch/executorch/pull/3056), that tags the d/dq nodes directly connected to the input/output in order for the delegate not to consume those nodes. Hence…. • …the q and dq nodes above are executed on CPU, which cost memory and cycles.

Desired behavior: Ideally, we’d like a mechanism to change the graph signature such that the int8-quantized-NN takes int8 input:

      int8                      int8  
input --  accelerated subgraph  --  output

How, where and when to do that in a way that works well with the rest of the framework is unclear.

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

freddan80 commented 2 months ago

Tagging @kimishpatel , @digantdesai, @robell, @oscarandersson8218

kimishpatel commented 2 months ago

cc @jerryzh168

digantdesai commented 2 months ago

@jerryzh168 this is what I wanted to sync on yesterday, I will set something up for this.

jerryzh168 commented 2 months ago

since it requires a change of graph signature, I feel we can have a separate API after convert_pt2e to do this.

model = convert_pt2e(model)

# name TBD
# annotator will be similar to quantizer, and it can tag specific input (placeholder) and output nodes (nodes in output.args) to make sure to remove quantize op for placeholder and dequantize op for output. then the graph signature should be updated accordingly as well
model = align_io_dtype(model, annotator)

zingo commented 2 weeks ago

As a note after starting to measure things it seems the initial quant step seem to takes a lot of cycles. Here is my first breakout of the time spent when running mobilenet V2 on a target.

CPU cycles for inference excluding running/waiting on the NPU was about 21200000ish cycles, this is a first run e.g. not pre-runned to warm up the caches, this might be added in future measurements.

To break this down some logs was added to box in the numbers and we see this:

~15700000 - Before our NPU code e.g. what is run from execute() to ArmBackend::execute() is called. Does is probably spending a lot of time on quant step over the whole input as not much else should be needed.
~2900000 - Permute input before command stream e.g. permute_CHW_to_HWC()
Run command stream on the NPU and wait for the result
~50000 - Permute output
~70000 - After ArmBackend::execute() exit to execute() exits, Does probably have a de-quant step over the output.

e.g. of the extra about 21200000ish cycles not spent in the NPU it seem like:

74% of the time is spent in Executorch code, including the extra quant/de-quant
14% on permuting/dim_ordering input/output

NOTE this is just fast and stupid check and can be somewhat/totally wrong. For example the extra logs print will mess up the numbers a bit I have tried to not count it but this could be done better if needed.

I'm trying to get the devtools (formerly known as SDK) going right now to do a better type of measuring . Just need to fight/figure out the flatcc build of things :) and get a way to transfer out the buffer from the target in a good way. Plan is to just base64 encode and print it..

digantdesai commented 1 week ago

Thanks, I got an ETA for the quant io pass to oss which is EoY, I can see if we can accelerate that to alleviate these issues.

Also not sure how to get around permute issues, unless we can permute on the NPU.

(1) can we isolate cold effects by doing a warmup run? (2) why q (before the model) is so expensive compared to dq at the end? Can the tensor size explain the delta?

freddan80 commented 5 days ago

Thx, for looking into this!

(1) The time CPU/NPU ratio for our "comparison-runtime" is ~0.001, while it's a ratio of ~3 here*. So even if we had a warm cache, I think the difference would be large. But I think we can give it a go. (2) Yeah, the input tensor size is 224x224x3 and the output merely 1000 I think. (A lot smaller anyways)

*Note: NPU time is in the same ballpark for both.

pytorch / executorch