TinyEngine convolutional layer has greater latency than ARM's CMSIS-NN

mit-han-lab / tinyengine

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory

MIT License

757 stars 127 forks source link

Hello,

I was measuring the latency on one of TinyEngine's convolutional kernels (convolve_s8_kernel3_stride1_pad1) versus CMSIS-NN's fast convolutional kernel (arm_convolve_HWC_q7_fast). The TinyEngine kernel had a latency of appx. 200000 cycles while the CMSIS kernel had a latency of appx. 130000 cycles.

Is the additional overhead due to the per channel requantization of Tiny Engine? Could you explain why per channel requantization is needed in the kernel?
Have you tried benchmarking the latencies of the frameworks per kernel? If so, could you share the results?

Thank you in advance.

mit-han-lab / tinyengine

TinyEngine convolutional layer has greater latency than ARM's CMSIS-NN #71