Closed eliberis closed 2 years ago
Hi Edgar,
Thanks for your interest in our work! We do not include the full JPEG size for the memory calculation since they can be partially decoded. The JPEG file from the camera is usually small, around 3-4kB. For the decoding cost, we may need to spend some time digging into experiment logs to find it.
Best, Ji
Thanks for the quick reply, Ji!
Hi Ji,
Just wanted to follow up on a similar topic. Table 1 also shows overheads in MACs, and I'm not sure the numbers are consistent within themselves:
From this, there are 2 ways to calculate, for example, the number of MACs required for the remaining (not patch-executed) layers. To my understanding, this should be the same in normal and patch-based execution.
(1): MACs for the remaining network = 330M - 130M = 200M
(2a): Patch-stage is 130M which includes the reported 42% overhead. Therefore those layers were originally 91.5M (91.5M + 42% = 130M). (2b): The entire network is 330M which includes the reported 10% overhead. Therefore the entire network was originally 300M. (2c). From (2a) and (2b), MACs for the remaining network should be 300M - 91.5M = 208.5M
So (1) and (2c) give different numbers. Would you mind checking if there's an issue in my calculation?
Thanks!
Hi Edgar, thanks for pointing out the issue. Let me double-check and get back to you.
Hi Ji,
Have you had a chance to look into the above?
Thanks! Edgar
Hi @eliberis, sorry for the delay. I re-run the profiling, and it seems I made some mistakes during the measurement. So the overhead for the patch-based part is not calculated correctly (I believe I used the 300M FLOPs for MBV2 from the paper, but it should be 307M, actually). The profiling results should be:
Patch-based | Rest of model | Overall | |
---|---|---|---|
per-layer | 95.5M | 212.0M | 307.5M |
patch-based | 125.2M | 212.0M | 337.2M |
overhead | 31% | 0% | 10% |
Patch-based | Rest of model | Overall | |
---|---|---|---|
per-layer | 64.7M | 238.9M | 303.6M |
patch-based | 73.6M | 238.9M | 312.5M |
overhead | 14% | 0% | 3% |
We will update the numbers in the next arXiv update. Let me know if there is any confusion.
That looks correct to me, thanks!
Hi, thanks for publishing MCUNet-v2, it is very interesting work!
I had a question about how you computed the 172kB per-patch peak SRAM usage for the MbV2 model in Table 1 here: https://arxiv.org/pdf/2110.15352.pdf
If I understood correctly, you are considering 4 patches in X and Y dimensions: a total of 4x4=16 patches of size 75 x 75 x 3. The patches are read from the input 224 x 224 x 3. Then, per-patch execution ends with a patch of size 7 x 7 x 32 which is then written to the final 28 x 28 x 32 tensor.
I have made my calculations (happy to share in a follow-up comment) which give a higher peak memory usage. I suspect the mismatch happened because the full input image may not be included in the numbers reported in the paper. Could you confirm if you include the full image tensor (224x224x3 = 147kB @ int8) in your reported peak memory usage?
You mention the input doesn't have to be fully stored because it can be partially decoded from JPEG. If you assume this in reported numbers, could you share additional memory usage coming from the microcontroller receiving and storing a JPEG-compressed input instead, and then extra latency/MACs required to decode it?
Thanks for your help!