mit-han-lab / tinyengine

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory
https://mcunet.mit.edu
MIT License
757 stars 127 forks source link

TinyEngine convolutional layer has greater latency than ARM's CMSIS-NN #71

Open ellial opened 1 year ago

ellial commented 1 year ago

Hello,

I was measuring the latency on one of TinyEngine's convolutional kernels (convolve_s8_kernel3_stride1_pad1) versus CMSIS-NN's fast convolutional kernel (arm_convolve_HWC_q7_fast). The TinyEngine kernel had a latency of appx. 200000 cycles while the CMSIS kernel had a latency of appx. 130000 cycles.

Thank you in advance.

meenchen commented 1 year ago

Hi @ellial,

convolve_s8_kernel3_stride1_pad1 is a deprecated kernel and not actively used in TinyEngine. For 3x3 convolution kernel, we use https://github.com/mit-han-lab/tinyengine/blob/main/TinyEngine/src/kernels/int_forward_op/convolve_u8_kernel3_inputch3_stride2_pad1.c instead. Please also note for mobilenet-like models, most computation goes to pointwise and depthwise convolutions.