shicai / MobileNet-Caffe

Caffe Implementation of Google's MobileNets (v1 and v2)
BSD 3-Clause "New" or "Revised" License
1.26k stars 707 forks source link

MobileNet v2 CPU inferencing performance #52

Open matt-ny opened 6 years ago

matt-ny commented 6 years ago

Comparing mobilenet v1 and v2 for inferencing on the CPU, I have observed some surprising numbers:

  1. For v1, my inference time was about 148ms on average. For v2, the average was 185ms (25% slower)

  2. The max_rss memory usage of the process reported about 160MB increase in memory for each copy of the MobileNet v1 loaded in Caffe, after initializing the Net and running 1 forward pass. For v2, the increase was about 300MB per copy.

I am using BVLC Caffe with Intel MKL, doing the measurements on the same system ( Intel Xeon CPU E5-2658 v2 @ 2.40GHz ) contemporaneously, and discarding the first few timings of each to "warm up" any caching.

From the paper I expected inference time and mem usage to be less.... am I missing something?

yangluoluo commented 6 years ago

me too, I think it is not the official version

yangluoluo commented 6 years ago

conv group need to be optimized

matt-ny commented 6 years ago

conv group need to be optimized

I am comparing v1 and v2, both from this repo, and see the performance of v2 in terms of speed and memory usage is worse for my CPU. So the extent to which convolutions are optimized by Caffe is constant across the comparison. I do see total MACC ops 573M vs 438M for v1 vs v2 so v2 is doing fewer convo ops.

Perhaps the size of certain blobs is causing many CPU cache misses? This processor has a 25MB cache size.

https://dgschwend.github.io/netscope/#/editor reports total activation (in number of floats, not bytes) of about 35M for MN v2 and 20M for v1.

For mobilenet v2 the largest single layer activation was 3.61M:

17 | conv2_2 | submodule(2) |   | 16 | 112x112 | 96 | 112x112 | macc: 20.47M | activation: 3.61M

vs in v1, the largest single layer activation was 2.41M .

@shicai any thoughts?