Open freddan80 opened 2 months ago
Tagging @kimishpatel , @digantdesai, @robell, @oscarandersson8218
cc @jerryzh168
@jerryzh168 this is what I wanted to sync on yesterday, I will set something up for this.
since it requires a change of graph signature, I feel we can have a separate API after convert_pt2e to do this.
model = convert_pt2e(model)
# name TBD
# annotator will be similar to quantizer, and it can tag specific input (placeholder) and output nodes (nodes in output.args) to make sure to remove quantize op for placeholder and dequantize op for output. then the graph signature should be updated accordingly as well
model = align_io_dtype(model, annotator)
As a note after starting to measure things it seems the initial quant step seem to takes a lot of cycles. Here is my first breakout of the time spent when running mobilenet V2 on a target.
CPU cycles for inference excluding running/waiting on the NPU was about 21200000ish cycles, this is a first run e.g. not pre-runned to warm up the caches, this might be added in future measurements.
To break this down some logs was added to box in the numbers and we see this:
e.g. of the extra about 21200000ish cycles not spent in the NPU it seem like:
NOTE this is just fast and stupid check and can be somewhat/totally wrong. For example the extra logs print will mess up the numbers a bit I have tried to not count it but this could be done better if needed.
I'm trying to get the devtools (formerly known as SDK) going right now to do a better type of measuring . Just need to fight/figure out the flatcc build of things :) and get a way to transfer out the buffer from the target in a good way. Plan is to just base64 encode and print it..
Thanks, I got an ETA for the quant io pass to oss which is EoY, I can see if we can accelerate that to alleviate these issues.
Also not sure how to get around permute issues, unless we can permute on the NPU.
(1) can we isolate cold effects by doing a warmup run? (2) why q (before the model) is so expensive compared to dq at the end? Can the tensor size explain the delta?
Thx, for looking into this!
(1) The time CPU/NPU ratio for our "comparison-runtime" is ~0.001, while it's a ratio of ~3 here*. So even if we had a warm cache, I think the difference would be large. But I think we can give it a go. (2) Yeah, the input tensor size is 224x224x3 and the output merely 1000 I think. (A lot smaller anyways)
*Note: NPU time is in the same ballpark for both.
🚀 The feature, motivation and pitch
Background: Fp32 arithmetic typically is avoided in the embedded (microcontroller) domain, due to tight cycle and memory constraints. Hence, sensors usually produce integer data. Therefore, the input/output to an int8-quantized-NN should ideally be of integer dtype (int8) in order to save cycles and memory.
Current behavior: Input/output is always fp32. Example:
Notes: • In this example, “accelerated subgraph” is a node (subgraph) delegated to e.g. an NPU such as Ethos-U. • For the Arm TOSA delegate, we have implemented a workaround (https://github.com/pytorch/executorch/pull/3056), that tags the d/dq nodes directly connected to the input/output in order for the delegate not to consume those nodes. Hence…. • …the q and dq nodes above are executed on CPU, which cost memory and cycles.
Desired behavior: Ideally, we’d like a mechanism to change the graph signature such that the int8-quantized-NN takes int8 input:
How, where and when to do that in a way that works well with the rest of the framework is unclear.
Alternatives
No response
Additional context
No response
RFC (Optional)
No response