mit-han-lab / mcunet

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
https://mcunet.mit.edu
MIT License
480 stars 83 forks source link

MCUNet-v2 analytical PMU computation for MbV2 on ImageNet r224 #9

Closed eliberis closed 2 years ago

eliberis commented 2 years ago

Hi, thanks for publishing MCUNet-v2, it is very interesting work!

I had a question about how you computed the 172kB per-patch peak SRAM usage for the MbV2 model in Table 1 here: https://arxiv.org/pdf/2110.15352.pdf

If I understood correctly, you are considering 4 patches in X and Y dimensions: a total of 4x4=16 patches of size 75 x 75 x 3. The patches are read from the input 224 x 224 x 3. Then, per-patch execution ends with a patch of size 7 x 7 x 32 which is then written to the final 28 x 28 x 32 tensor.

I have made my calculations (happy to share in a follow-up comment) which give a higher peak memory usage. I suspect the mismatch happened because the full input image may not be included in the numbers reported in the paper. Could you confirm if you include the full image tensor (224x224x3 = 147kB @ int8) in your reported peak memory usage?

You mention the input doesn't have to be fully stored because it can be partially decoded from JPEG. If you assume this in reported numbers, could you share additional memory usage coming from the microcontroller receiving and storing a JPEG-compressed input instead, and then extra latency/MACs required to decode it?

Thanks for your help!

tonylins commented 2 years ago

Hi Edgar,

Thanks for your interest in our work! We do not include the full JPEG size for the memory calculation since they can be partially decoded. The JPEG file from the camera is usually small, around 3-4kB. For the decoding cost, we may need to spend some time digging into experiment logs to find it.

Best, Ji

eliberis commented 2 years ago

Thanks for the quick reply, Ji!

eliberis commented 2 years ago

Hi Ji,

Just wanted to follow up on a similar topic. Table 1 also shows overheads in MACs, and I'm not sure the numbers are consistent within themselves:

image

From this, there are 2 ways to calculate, for example, the number of MACs required for the remaining (not patch-executed) layers. To my understanding, this should be the same in normal and patch-based execution.

(1): MACs for the remaining network = 330M - 130M = 200M

(2a): Patch-stage is 130M which includes the reported 42% overhead. Therefore those layers were originally 91.5M (91.5M + 42% = 130M). (2b): The entire network is 330M which includes the reported 10% overhead. Therefore the entire network was originally 300M. (2c). From (2a) and (2b), MACs for the remaining network should be 300M - 91.5M = 208.5M

So (1) and (2c) give different numbers. Would you mind checking if there's an issue in my calculation?

Thanks!

tonylins commented 2 years ago

Hi Edgar, thanks for pointing out the issue. Let me double-check and get back to you.

eliberis commented 2 years ago

Hi Ji,

Have you had a chance to look into the above?

Thanks! Edgar

tonylins commented 2 years ago

Hi @eliberis, sorry for the delay. I re-run the profiling, and it seems I made some mistakes during the measurement. So the overhead for the patch-based part is not calculated correctly (I believe I used the 300M FLOPs for MBV2 from the paper, but it should be 307M, actually). The profiling results should be:

Patch-based Rest of model Overall
per-layer 95.5M 212.0M 307.5M
patch-based 125.2M 212.0M 337.2M
overhead 31% 0% 10%
Patch-based Rest of model Overall
per-layer 64.7M 238.9M 303.6M
patch-based 73.6M 238.9M 312.5M
overhead 14% 0% 3%

We will update the numbers in the next arXiv update. Let me know if there is any confusion.

eliberis commented 2 years ago

That looks correct to me, thanks!