This is the official implementation of TinyEngine, a memory-efficient and high-performance neural network library for Microcontrollers. TinyEngine is a part of MCUNet, which also consists of TinyNAS. MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. TinyEngine and TinyNAS are co-designed to fit the tight memory budgets.
The MCUNet and TinyNAS repo is here.
If you are interested in getting updates, please sign up here to get notified!
Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications, but the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.
MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. It consists of TinyNAS and TinyEngine. They are co-designed to fit the tight memory budgets. With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.
Specifically, TinyEngine is a memory-efficient inference library. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing memory usage and accelerating the inference. It outperforms existing inference libraries such as TF-Lite Micro from Google, CMSIS-NN from Arm, and X-CUBE-AI from STMicroelectronics.
TinyEngine adopts the following optimization techniques to accelerate inference speed and minimize memory footprint.
By adopting the abovementioned optimization techniques, TinyEngine can not only enhance inference speed but also reduce peak memory, as shown in the figures below.
MAC/s improvement breakdown:
Peak memory reduction:
To sum up, our TinyEngine inference engine could be a useful infrastructure for MCU-based AI applications. It significantly improves the inference speed and reduces the memory usage compared to existing libraries like TF-Lite Micro, CMSIS-NN, X-CUBE-AI, etc. It improves the inference speed by 1.1-18.6x, and reduces the peak memory by 1.3-3.6x.
Save Memory with Patch-based Inference: We can dramastically reduce the inference peak memory by using patch-based inference for the memory-intensive stage of CNNs.
For MobileNetV2, using patch-based inference allows us to reduce the peak memory by 8x.
With patch-based infernece, tinyengine achieves higher accuracy at the same memory budgets.
code_generator
contains a python library that is used to compile neural networks into low-level source code (C/C++).
TinyEngine
contains a C/C++ library that implements operators and performs inference on Microcontrollers.
examples
contains the examples of transforming TFLite models into our TinyEngine models.
tutorial
contains the demo tutorial (of inference and training) of deploying a visual wake words (VWW) model onto microcontrollers.
assets
contains misc assets.
First, clone this repository:
git clone --recursive https://github.com/mit-han-lab/tinyengine.git
(Optional) Using a virtual environment with conda
is recommended.
conda create -n tinyengine python=3.6 pip
conda activate tinyengine
Install dependencies:
pip install -r requirements.txt
Install pre-commit hooks to automatically format changes in your code.
pre-commit install
Please see tutorial to learn how to deploy a visual wake words (VWW) model onto microcontrollers by using TinyEngine. We include both the inference demo and the training demo in the tutorial, please take a look!
-Ofast
optimization level in STM32CubeIDE.The latency results:
net_id | TF-Lite Micro @ 713b6ed |
CMSIS-NN @ 011bf32 |
X-CUBE-AI v7.3.0 |
TinyEngine @ 0363956 |
---|---|---|---|---|
# mcunet models (VWW) | ||||
mcunet-vww0 | 587ms | 53ms | 32ms | 27ms |
mcunet-vww1 | 1120ms | 97ms | 57ms | 51ms |
mcunet-vww2 | 5310ms | 478ms | 269ms | 234ms |
# mcunet models (ImageNet) | ||||
mcunet-in0 | 586ms | 51ms | 35ms | 25ms |
mcunet-in1 | 1227ms | 103ms | 63ms | 56ms |
mcunet-in2 | 6463ms | 642ms | 351ms | 280ms |
mcunet-in3 | 7821ms | 770ms | 414ms | 336ms |
mcunet-in4 | OOM | OOM | 516ms | 463ms |
# baseline models | ||||
proxyless-w0.3-r64 | 512ms | 54kB | 35kB | 23kB |
proxyless-w0.3-r176 | 3801ms | 380ms | 205ms | 176ms |
mbv2-w0.3-r64 | 467ms | 43ms | 29ms | 23ms |
The peak memory (SRAM) results:
net_id | TF-Lite Micro @ 713b6ed |
CMSIS-NN @ 011bf32 |
X-CUBE-AI v7.3.0 |
TinyEngine @ 0363956 |
---|---|---|---|---|
# mcunet models (VWW) | ||||
mcunet-vww0 | 163kB | 163kB | 88kB | 59kB |
mcunet-vww1 | 220kB | 220kB | 113kB | 92kB |
mcunet-vww2 | 385kB | 390kB | 201kB | 174kB |
# mcunet models (ImageNet) | ||||
mcunet-in0 | 161kB | 161kB | 69kB | 49kB |
mcunet-in1 | 219kB | 219kB | 106kB | 96kB |
mcunet-in2 | 460kB | 469kB | 238kB | 215kB |
mcunet-in3 | 493kB | 493kB | 243kB | 260kB |
mcunet-in4 | OOM | OOM | 342kB | 416kB |
# baseline models | ||||
proxyless-w0.3-r64 | 128kB | 136kB | 97kB | 35kB |
proxyless-w0.3-r176 | 453kB | 453kB | 221kB | 259kB |
mbv2-w0.3-r64 | 173kB | 173kB | 88kB | 61kB |
The Flash memory usage results:
net_id | TF-Lite Micro @ 713b6ed |
CMSIS-NN @ 011bf32 |
X-CUBE-AI v7.3.0 |
TinyEngine @ 0363956 |
---|---|---|---|---|
# mcunet models (VWW) | ||||
mcunet-vww0 | 627kB | 646kB | 463kB | 453kB |
mcunet-vww1 | 718kB | 736kB | 534kB | 521kB |
mcunet-vww2 | 1016kB | 1034kB | 774kB | 741kB |
# mcunet models (ImageNet) | ||||
mcunet-in0 | 1072kB | 1090kB | 856kB | 842kB |
mcunet-in1 | 937kB | 956kB | 737kB | 727kB |
mcunet-in2 | 1084kB | 1102kB | 849kB | 830kB |
mcunet-in3 | 1091kB | 1106kB | 867kB | 835kB |
mcunet-in4 | OOM | OOM | 1843kB | 1825kB |
# baseline models | ||||
proxyless-w0.3-r64 | 1065kB | 1084kB | 865kB | 777kB |
proxyless-w0.3-r176 | 1065kB | 1084kB | 865kB | 779kB |
mbv2-w0.3-r64 | 940kB | 959kB | 768kB | 690kB |
If you find the project helpful, please consider citing our paper:
@article{
lin2020mcunet,
title={Mcunet: Tiny deep learning on iot devices},
author={Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Gan, Chuang and Han, Song},
journal={Advances in Neural Information Processing Systems},
volume={33},
year={2020}
}
@inproceedings{
lin2021mcunetv2,
title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
year={2021}
}
@article{
lin2022ondevice,
title = {On-Device Training Under 256KB Memory},
author = {Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song},
booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
year = {2022}
}
MCUNet: Tiny Deep Learning on IoT Devices (NeurIPS'20)
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning (NeurIPS'21)
MCUNetV3: On-Device Training Under 256KB Memory (NeurIPS'22)