Open soumith opened 9 years ago
For comparison, here's the log of Caffe + OpenBLAS numbers on the same machine (It's the Digits box ;-) ) https://github.com/soumith/convnet-benchmarks/blob/cpu/caffe/output_alexnet.log
More info is in the CPU branch: https://github.com/soumith/convnet-benchmarks/tree/cpu
The alexnet-owt protobuf, with the same architecture I use for the GPU versions is here: https://github.com/soumith/convnet-benchmarks/blob/cpu/caffe/imagenet_winners/alexnet.prototxt
The intel-adapted version is here: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/models/intel_alexnet/alexnet.prototxt
well, assuming i didn't mess up the analysis, and used the right inputs/etc, a runtime of 0.146s on the (non-intel) alexnet-owl prototxt you linked above, for a batch of 128 forward and backward, implies 3.77TF/s.
AFAIK, haswell can do at most 32FLOPs/cycle/core. for your 6-core cpu @ 3.5 GHZ, that would be 672GF/s peak.
so, i guess that seems pretty fishy overall (i.e. perf ~6X peak). i might suspect benchmarking error, such as accidentally running in GPU mode with who-knows-what backend (i.e BLAS, cudnn v?, i dunno). it's not clear that intel themselves was claiming perf anything like that in thier blog post, but i didn't try to runs the #s on their post.
then again, i have no idea what the intel code might be doing (got scared off by the license, so didn't dig into it), but if there are some algorithmic changes and/or anything that means they're not doing the same set of FLOPS, then all bets are off. but of course such improvement might port to GPUs as well. or not; i'd believe there are algorithms that are more suited to CPUs that trade uniformity/complexity for doing less raw FLOPS.
for ref, here's the #s i'm working from:
moskewcz@maaya:~/git_work/boda/run/tr1$ boda cnet_ana --in-model=alexnet_owl --print-ops=1 --in-sz=227 && python ../../pysrc/flops.py --per-layer=1 --backward 1 --num-imgs=128 --runtime=.164
conv1 FWD 18.7GF 182MB --- BACK_GRAD 18.7GF --- BACK_DIFF 18.7GF BACKWARD_BYTES 261MB
conv2/5x5_s1 FWD 61.7GF 104MB --- BACK_GRAD 61.7GF --- BACK_DIFF 61.7GF BACKWARD_BYTES 131MB
conv3/3x3_s1 FWD 33.3GF 60.5MB --- BACK_GRAD 33.3GF --- BACK_DIFF 33.3GF BACKWARD_BYTES 82.4MB
conv4/3x3_s1 FWD 44.4GF 67.8MB --- BACK_GRAD 44.4GF --- BACK_DIFF 44.4GF BACKWARD_BYTES 110MB
conv5/3x3_s1 FWD 29.6GF 53.7MB --- BACK_GRAD 29.6GF --- BACK_DIFF 29.6GF BACKWARD_BYTES 81.8MB
fc6 FWD 13.2GF 214MB --- BACK_GRAD 13.2GF --- BACK_DIFF 13.2GF BACKWARD_BYTES 426MB
fc7 FWD 4.29GF 71.3MB --- BACK_GRAD 4.29GF --- BACK_DIFF 4.29GF BACKWARD_BYTES 141MB
fc8 FWD 1.05GF 19.0MB --- BACK_GRAD 1.05GF --- BACK_DIFF 1.05GF BACKWARD_BYTES 37.5MB
total _inxp time: 0s
-- INPUT: NUM_IMGS=128 --
-- INPUT: RUNTIME=0.164s --
-- INPUT: POWER=200W --
--- FWD TOTALS ---
618GF 3.77TF/s
2.04GB 12.5GB/s AI=303F/B
32.8J 18.8GF/s/W
moskewcz@maaya:~/git_work/boda/run/tr1$
@moskewcz 3.77TF/s doesn't hold true if you switch to FFT or Winograd based convolutions.
References: https://en.wikipedia.org/wiki/Convolution_theorem http://arxiv.org/abs/1509.09308
"With these optimizations time to train AlexNet* network on full ILSVRC-2012 dataset to 80% top5 accuracy reduces from 58 days to about 5 days."
The benchmark used dual E5-2699-v3 CPUs, which have 18 cores at 2.3 GHz => 2x18x32FLOPs/cyclex2.3Ghz=2.65TFLOPs
Sounds about right.
TitanX running Nervanagpu probably about 1 day?
I would guess Intel just implemented a more efficient direct convolution for many-core Intel CPUs. I do not see any indication they are using fast algorithns.
So anyway the numbers Intel reported sound plausible, but your numbers don't. :-)
again, if i got my #s right, if we assume 70M images (~65 epochs * 1.1M images/epoch, not sure if that's a good value or not) in 5 days to train alexnet_owl as per the blog post, that implies 783GF/s -- given the peak #s that andravin gave above, that would be ~35% efficiency, which is perhaps pretty impressive but believable. but it'd be good to know the actual # of epochs/images/etc to get a real value, i could easily be off by quite a bit on those guesses. corrections welcome.
mwm
moskewcz@maaya:~/git_work/boda/run/tr1$ boda cnet_ana --in-model=alexnet_owl --print-ops=1 --in-sz=227 && python ../../pysrc/flops.py --per-layer=1 --backward 1 --num-imgs=70000000 --runtime=432000
conv1 FWD 10.2PF 99.5TB --- BACK_GRAD 10.2PF --- BACK_DIFF 10.2PF BACKWARD_BYTES 143TB
conv2/5x5_s1 FWD 33.7PF 56.2TB --- BACK_GRAD 33.7PF --- BACK_DIFF 33.7PF BACKWARD_BYTES 70.2TB
conv3/3x3_s1 FWD 18.2PF 31.6TB --- BACK_GRAD 18.2PF --- BACK_DIFF 18.2PF BACKWARD_BYTES 42.1TB
conv4/3x3_s1 FWD 24.3PF 35.1TB --- BACK_GRAD 24.3PF --- BACK_DIFF 24.3PF BACKWARD_BYTES 56.2TB
conv5/3x3_s1 FWD 16.2PF 28.1TB --- BACK_GRAD 16.2PF --- BACK_DIFF 16.2PF BACKWARD_BYTES 42.1TB
fc6 FWD 7.19PF 4.66TB --- BACK_GRAD 7.19PF --- BACK_DIFF 7.19PF BACKWARD_BYTES 8.17TB
fc7 FWD 2.35PF 2.29TB --- BACK_GRAD 2.35PF --- BACK_DIFF 2.35PF BACKWARD_BYTES 3.44TB
fc8 FWD 573TF 1.43TB --- BACK_GRAD 573TF --- BACK_DIFF 573TF BACKWARD_BYTES 2.57TB
total _inxp time: 0s
-- INPUT: NUM_IMGS=70000000 --
-- INPUT: RUNTIME=432000.0s --
-- INPUT: POWER=200W --
--- FWD TOTALS ---
338PF 783GF/s
627TB 1.45GB/s AI=540F/B
86.4MJ 3.91GF/s/W
moskewcz@maaya:~/git_work/boda/run/tr1$
.. and having looked a bit at Caffe's CPU implementation, im2col is single-threaded, and will be a pretty nasty bottleneck in a 36-core system.
@moskewcz your numbers sound plausible to me.. and so Intel's post really points to what a disaster out of the box Caffe performance must be on many-core CPUs.
@andravin @moskewcz thanks. I'm going to investigate a bit on why the numbers are much more fluffier on my machine. For a start, I'll probably start an end-to-end training and see what happens....
sounds like a plan. make sure you fire up nvidia-smi while you're running it ... ;)
@moskewcz I've already verified that it's running on CPU and using intel code-paths, simply by collecting samples from the stack and looking at hotspots.
hmm, well, i was mostly joking and i mostly believe you. however, i'm not sure that what you say precludes the GPU being active. in fact, if, say, the new intel layers were running on the CPU, but all/some conv layers were on the GPU, you'd probably see perf similar to what you reported. and if you look at the CPU usage/stack, it'll be pegged at 100%, and it'll always be inside the intel code if you stop it ...
i'm really just suggesting that, given the fishiness of the #s, some form(s) of sanity checking are in order. in particular, for example, did you compile in CPU only mode? again, i don't really think that's the issue, but if (for example) intel ran/compiled on boxes without GPUs, then maybe something unexpected happens with their code/build on a box that has GPUs.
but i'm not really fixated on the maybe-running-on-GPU idea, there are plenty of other places for errors. batch size issues, shared library wackiness, straight-up user error, etc ...
on a side note, thanks for all your hard work running these benchmarks!
mwm
caffe is getting no access to the GPUs, I disabled it at the driver level. I just fixed the protobuf to force itself to do the backward phase (it was conveniently deciding that it doesn't need to do the backward). That brought the backward times up, and overall it stands at 268ms / mini-batch now. I'm working on training it fully with the imagenet lmdb. Let's see. https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365
2-columnn AlexNet Intel is benchmarking at the announcement (different from 1-col AlexNet "One weird trick" from Soumith's benchmark) has 1449 MFLOPs per image in the forward pass and 2x that in the backward pass, ignoring biases, LRN, activations, pooling, and loss. Taking numbers from Intel's announcement we have:
Forward pass: 1449 MFLOP * 731images/1sec=1.059 TFLOP/s Forw+Backw pass: 3 * 1449 MFLOP * 271 images/1sec=1.187 TFLOP/s
which is easily believable (exact max FLOPs on those Intel CPUs to be posted later).
@soumith>A full [forward + backward] on AlexNet on a Desktop 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz takes an average of 164ms EDIT: 268 ms. [...] I need a couple more sanity checks before I can believe this result. Look at how little time they are spending in the convolution layers, even the biggest ones: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365
i7-5930K AVX2 clock is smaller than 3.50 GHz base clock. I don't recall exact value, but it seems to be ~3.2 GHz. It can issue 2 AVX256 (8 operand) SP MAD (=2FLOP) per clock, for the total of 2 * 8 * 2=32 FLOP/clock.
32 FLOP/clock * 3.0GHz * 6core=576 GFLOP/s.
Your numbers at the url above (output from Intel's Caffe) seem to be per image for conv and per minibatch for fc) and are comfortably below that (except for fc6 backward, which must be an artifact of Caffe timing), so they are totally believable. In fact, there is a lot of room for improvement. In fact, they are not that much better than your numbers for OpenBLAS (except for conv1)
layer | MFLOP/image | ms/image | Intel GFLOP/s | OpenBLAS GFLOP/s |
---|---|---|---|---|
conv1 forward: | 141 | 0.726 | 194 | 59 |
conv1 backward | 282 | 0.672 | 420 | 67 |
conv2 forward: | 448 | 3.722 | 120 | 159 |
conv2 backward: | 896 | 6.22 | 144 | 167 |
conv3 forward: | 224 | 2.323 | 96 | 112 |
conv3 backward: | 448 | 3.604 | 124 | 148 |
conv4 forward: | 299 | 3.851 | 78 | 90 |
conv4 backward: | 598 | 6.344 | 94 | 116 |
conv5 forward: | 199 | 2.621 | 76 | 90 |
conv5 backward: | 398 | 4.375 | 91 | 119 |
fc6 forward: | 75 | 38.597/mb | 250 | 232 |
fc6 backward: | 151 | 32.152/mb | 601 | 243 |
fc7 forward: | 34 | 18.549/mb | 232 | 231 |
fc7 backward: | 67 | 15.504/mb | 554 | 293 |
fc8 forward: | 8 | 4.967/mb | 211 | 249 |
fc8 backward: | 16 | 3.932/mb | 533 | 278 |
Forward: | 1428 | 90.621 | 104 | 94 |
Backward: | 2856 | 72.961 | 132 | 123 |
Forward-Backward: | 4231 | 1684 | 121 | 112 |
@ozabluda i think your analysis of the intel #s looks good and is believable. as per an above comment, we're guessing ~2.65TFLOPs peak for the dual-socket 36-core machine intel used for the announcement. so again it comes out to ~35% or so efficiency.
but, i think there are some issues with your per-layer analysis in your second comment. firstly, i don't think we can trust the per-layer #s from the caffe log too much; for example the pack+relu1 times are >> the conv1 time, so i'd assume there's some timing wonkiness there -- time and/or work being shifted among layers for example.
but, perhaps more importantly (and confusingly): 1) the 1684 ms is for 10 iterations/batches. this is the value that got corrected to ~2680ms, with a corresponding 268ms forward+backward per batch. confusingly, the other two #s for forward and backward (the ~73ms back / ~91ms fwd) are per single iteration/batch. the idea is that they are the 'min' batch times across interactions, and thus in theory more indicative of the steady-state per-batch performance (which does seem to be the case). so for your forward-backward line you probably want to add the times of the forward and backwards lines and ignore the overall combined time. alternately you could divide it by the iteration count which will yield a similar value. 2) the 268ms is for a 128 image batch, not a single image. i believe your flop #s are for a single image (i have ~6.1GF for the no-groups regular alexnet per image, so i'd guess that your 4.2GF / image is right for the 'original' 2-groups version), so you're off by a factor of 128 in flops.
PS: using 268ms / batch, and 4.2GF / image, that yields a still-implausible ~2TF/s for the 6-core digits box, and again it seems to disagree with the more-reasonable intel announced #s, so i'm still assuming benchmarking error.
There is no such thing as an AVX2 clock.
@moskewcz I also noticed that Intel's Caffe seems to report timings for conv layers per image and for fc per minibatch. I corrected the table above (I also realized Soumith's numbers are for 1-col AlexNet, while Intel's are for 2-col AlexNet). Please check if it makes sense to you now.
AVX2 (32 SP ops/clock) can't run at the base clock frequency, so it throttles down to a lower "AVX clock". Although, maybe it is only true for AVX-512, which none of the CPUs in question have.
@ozabluda hmm, i'm not sure what you changed, but i guess it looks more/differently wrong to me now, still as per my (1) and (2). AFAIK all the caffe timings are supposedly per batch/iteration, not per image (as per my comment section (2)). and in this case, they look like garbage, as per my comment section (1). FWIW it's been a while since i dug into the caffe timing code and it has changed over time but on the whole i've always found it hard to work with / understand; i'm mostly just looking at things here from the top level and using my own calculations, so i'm not the best one to comment on the details of the caffe reported #s.
@moskewcz Stock Caffe timings sure are per minibatch (like Soumith's OpenBLAS timings). Intel's port timings do look like garbage (say 0.726ms for conv1), unless they are per image (except for fc), in which case they totally make sense (and approximately equal to stock Caffe/OpenBLAS). See my table above.
@andravin> The benchmark used dual E5-2699-v3 CPUs, which have 18 cores at 2.3 GHz => 2x18x32FLOPs/cyclex2.3Ghz=2.65TFLOPs
Actual AVX base clock is 1.9 Ghz (see quote below).
2 CPU * 18 cores * 32FLOPs/cycle * 1.9Ghz =2.189 TFLOP/s
I am almost willing to bet that the scaling to the second CPU is extremely poor in this Intel's iteration. i.e. 2 CPUs are not that much faster than 1 CPU.
To cope with the huge difference between the power consumption of Integer and AVX code, Intel is >introducing new base and Turbo Boost frequencies for all their SKUs; these are called AVX >base/Turbo. For example, the E5-2693 v3 will start from a base frequency of 2.3GHz and turbo up >to 3.3GHz when running non-AVX code. When it encounters AVX code however, it will not able to >boost its clock to more than 3GHz during a 1 ms window of time. If the CPU comes close to thermal >and TDP limits, clock speed will drop down to 1.9GHz, the "AVX base clock". http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5
@ozabluda Ah, I did not know about this feature of Xeon processors, thanks. So it is Xeon only? soumith's Core(TM) i7-5930K will not have this? My i7-5775C seems to sustain AVX2 256-bit FMA instructions at regular turbo boost speed with liquid cooling.
I tracked down AVX base frequency specs for haswell e5 processors here: https://www.microway.com/knowledge-center-articles/detailed-specifications-intel-xeon-e5-2600v3-haswell-ep-processors/
Would be nice to find an official Intel source. I suspect this is only a feature of the big Xeon chips.
@soumith What command line did you use? README.txt says:
For timing
#> ./build/tools/caffe time \
-iterations <number of iterations> \
--model=models/intel_alexnet/train_val.prototxt
When I run that on my 4-core i7-5775C I get:
I1016 16:39:40.395843 15816 caffe.cpp:333] conv1 forward: 379.242 ms.
I1016 16:39:40.395848 15816 caffe.cpp:336] conv1 backward: 354.405 ms.
[...]
I1016 16:39:40.396093 15816 caffe.cpp:341] Min Forward pass: 2879.16 ms.
I1016 16:39:40.396098 15816 caffe.cpp:343] Min Backward pass: 5410.64 ms.
I1016 16:39:40.396102 15816 caffe.cpp:345] Min Forward-Backward: 83316 ms.
I1016 16:39:40.396107 15816 caffe.cpp:347] Total Time: 83316 ms.
[...]
Total FP jobs:8192 jpt:2048 residue:0
Total BP jobs:106496
Most telling are the Total FP/BP jobs numbers, which are exactly equal to 256X the values in your log file. 256 is the batch size specified in train_val.prototxt.
@soumith Oh I see now you are using your own prototxt file, not the one that was provided by Intel. Obviously there is something wrong that is causing your prototxt to use minibatch size 1.
Actually I get reasonable numbers using your alexnet.prototxt too. So I am not sure what is wrong with your setup.
@andravin:
Ah, I did not know about this feature of Xeon processors, thanks. So it is Xeon only? My i7-5775C seems to sustain AVX2 256-bit FMA instructions at regular turbo boost speed with liquid cooling.
I think all CPUs have it, if they overheat. Liquid cooling helps (I notice dthat with my liquid cooled Haswell as well. Can your CPU run AVX2 256-bit FMA instructions at regular turbo boost speed on all cores simultaneously or just one?
I tracked down AVX base frequency specs for haswell e5 processors here: [microway]
This is awesome, thank you.
@moskewcz:
i think there are some issues with your per-layer analysis in your second comment. firstly, i don't think we can trust the per-layer #s from the caffe log too much; for example the pack+relu1 times are >> the conv1 time, so i'd assume there's some timing wonkiness there -- time and/or work being shifted among layers for example.
I think something caused conv layers to report time per image, while everything else is per minibatch.
but, perhaps more importantly (and confusingly): 1) the 1684 ms is for 10 iterations/batches. this is the value that got corrected to ~2680ms, with a >corresponding 268ms forward+backward per batch. confusingly, the other two #s for forward and >backward (the ~73ms back / ~91ms fwd) are per single iteration/batch.
My calculations are per-layer. Total Forward/Backward are also calculated from per-layer (reported numbers are all screwed up), exactly as you suggest.
[...] so for your forward-backward line you probably want to add the times of the forward and >backwards lines and ignore the overall combined time. alternately you could divide it by the >iteration count which will yield a similar value. 2) the 268ms is for a 128 image batch, not a single image.
I ignore 2680/268 number.
i believe your flop #s are for a single image
that's right.
(i have ~6.1GF for the no-groups regular alexnet per image, so i'd guess that your 4.2GF / image is right for the 'original' 2-groups version), so you're off by a factor of 128 in flops.
I have 4.231 GF/image for the 'original' 2-groups version and 4.285 GF/image for the "One weird trick" 1-col version, ignoring biases, LRN, activations, pooling, and loss. Your 6.1 GF/image is probably the 'original' 2-groups version without groups, but it's not what 1-col version is (the number of filtermaps is different).
PS: using 268ms / batch, and 4.2GF / image, that yields a still-implausible ~2TF/s for the 6-core >digits box, and again it seems to disagree with the more-reasonable intel announced #s, so i'm still >assuming benchmarking error.
My calculated "total time" conv*128+fc comes to 4524 ms/minibatch. I ignore 268, because it doesn't correspond to anything in the per-layer I can think of. 90ms and 72ms correspond to the sum, but is incorrect because conv is per image and everything else is per minibatch.
@andravin thanks for the log on your side. I suppose doing pure-benchmarking instead of having that lmdb data layer before might be having side-effects on the intel caffe. I'll follow-up on Monday.
@ozabluda Here are official Intel documents about avx and frequencies for Xeon E5 v3, does not mention other processors, which of course leaves us wondering: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf
Still haven't found anything authoritative for i7. Probably have to ask Intel.
Also I want to make clear that I think @soumith 's log file indicates that the batch size was just 1 image. Not sure why his alexnet.prototxt gives me batch size 128 and behaves differently for him.
@andravin>Here are official Intel documents about avx and frequencies for Xeon E5 v3, does not mention other processors, which of course leaves us wondering:
Thank you. These are good. I think all Intel CPUs are in practice limited only by TDP (which liquid cooling helps with), even though Intel also lists current and power limits. Overclocked Intel CPUs are known to suck 350W on Prime95 without damage and 400W, maybe with long-term damage, and Intel CPUs don't prevent it, if cooled.
The doc says that AVX will never go over "AVX Max All Core Turbo" (even though the doc implies that it should only be true for AVX2.
Also I want to make clear that I think @soumith 's log file indicates that the batch size was just 1 image.
I don't think so. There is:
input_dim: 128 Top shape: 128 3 227 227 (19787136)
https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L5 https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L183
and timings for all non-conv layers (fc, relu, pool) look like they are for minibatch size=128. It looks more like only conv layer timings are per image for whatever reason.
@soumith:
@moskewcz 3.77TF/s doesn't hold true if you switch to FFT or Winograd based convolutions. http://arxiv.org/abs/1509.09308
This is pretty awesome, @andravin
Thanks, @ozabluda I'm looking forward to the first truly efficient implementations of the fast Winograd convnet algorithms. The first draft of the paper was just a teaser. ;-)
For one 4x4 block, F(2x2, 3x3), standard direct convolution uses (3 * 3) * (2 * 2)=36 multiplications, and 6*4=24 additions, for the total 36+24=60 FLOP.
Ignoring amortized filter [1], and amortized inverse [2] @andravin's implementation of Winograd's convolutions uses 4*4=16 multiplications, and 80/3 amortized additions for the total of 16+80/3=42+2/3 FLOP.
Utilization is 60/(16+80/3)= 140.625%, which is how he gets results in Table 6 (max efficiency 134.0% on conv4.2).
I tried counting absolute minimum number of amortized additions, ignoring filter [1], inverse [2], assuming infinite image and infinite CPU registers.
I counted 24 additions for data per block.
This gives us 16+24=40 FLOP. Compared to standard direct 60 FLOP, we have 60/40= 150% max possible utilization.
[1] Paper says filter transform uses 28 FLOP per input channel. For conv4.2 image is 24x24, which makes filter FLOP negligible.
[2] Paper says inverse transform uses 24 additions, amortized over input channels, which is negligible for all layers, except conv1.2, but even there it's a win 60/(16+24+24/3)=1.25, not sure which is why it is not in Table 6.
C^TdC= | |||
---|---|---|---|
d00−d20−d02+d22 | d20−d22+d10−d12 | d20−d22−d10+d12 | d10−d12−d30+d32 |
d01−d21+d02−d22 | d21+d22+d11+d12 | d21+d22−d11− d12 | d11+d12−d31−d32 |
−d01+d21+d02−d22 | −d21+d22−d11+d12 | −d21+d22+d11−d12 | −d11+d12+d31−d32 |
d01−d21−d03+d23 | d21−d23+d11−d13 | d21−d23−d11+d13 | d11−d13−d31+d33 |
@ozabluda Thanks, one thing I think you are missing is that transformed data can be re-used for convolution with every filter. So the data transform FLOPs can be amortized over the number of filters.
Anyway I don't want to hijack this issue, so please continue the conversation of Winograd convnet algorithms at https://www.reddit.com/r/MachineLearning/comments/3nocg5/fast_algorithms_for_convolutional_neural_networks/
@andravin Aha! This is why you keep referring to only the number of multiplications in "arithmetic complexity reduction".
Ok, so today I finally finished building my caffe lmdb for imagenet, and I ran the intel benchmarks with the lmdb data layer etc. etc. (just like how they want it to be).
The numbers are not as impressive anymore (as expected).
References: Caffe + MKL: https://github.com/soumith/convnet-benchmarks/blob/cpu/caffe/output_alexnet_mkl.log#L320-L329
IntelCaffe + MKL: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L362-L371
In related news, I just finished the first winograd fprop/bprop fp32 kernel. It is fully fused and requires no additional memory. But the big news is that it runs fastest at a minibatch size of 8. And by fast I mean close to 10 virtual Tflops. It is full utilization and is primarily power limited. The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops.
I have a fair amount of tuning I want to try with it to see if I can boost cache performance, then I'll move on to the grad weight update kernel which I already have sketched out.
Then after that I'll take a stab at the bigger and faster transforms. I'm hoping to hit about 3x the performance of direct conv, and also at N=8. But those will be much trickier to fit in a fully fused kernel.
Special thanks to @andravin for lots of fruitful discussion on bringing this about.
@scott-gray that sounds super exciting. Cant wait to bench it.
I second that :) Can't wait to try this out! With the pervasiveness of 3x3 convolutions nowadays, this could be a game changer.
Hi, I'm one of the developers who worked on this package. I've looked at the run.sh and the only suggestion I have is to enable OpenMP thread affinity by setting KMP_AFFINITY=compact,granularity=fine
(assuming that the CPU has HyperThreading enabled). This probably should be done for the baseline run as well.
Looking at the logs, I see that the speedups for the convolution layers are not as high as we'd expect, but we never ran on a 6-core machine, so maybe our expectations are wrong. The CPU convolution layers often call tall and skinny SGEMMs which have limited scalability for a 2x18-core machine. But on a 6-core machine the gap between the SGEMM-based convolution and the approach we used may be much more narrow.
Also, it's weird that the fc layers run slower in the new package because we did not modify that code.
Here's a link to a whitepaper explaining how CPU changes frequency when executing AVX2 instructions.
Thanks, @rsdubtso We found that whitepaper but it only explicitly mentions Xeon E5 v3 processors. Are other processors (eg i7) affected by AVX2 frequencies, if so where can we find documentation of the AVX2 frequencies for those processors? Anyway I opened an Intel forum ticket for it here: https://communities.intel.com/thread/87851
Nice work Scott! Looking forward to playing with it
@scott-gray strikes again! Well done.
@soumith @rsdubtso
The numbers are not as impressive anymore (as expected). Caffe + MKL - ~ 5100 ms IntelCaffe + MKL - ~ 3052 ms Speedup: 1.67x
Actually, there is tremendous improvement in the convolutional layers (still far from 614 GFLOP/s peak), even bigger improvements in pool and activation layers (don't matter much), huge regression in fc layers - should be easy to fix, and huge regression in data layer (even easier to fix). Caffe/MKL is also much faster that Caffe/OpenBLAS you benchmarked earlier. There is a timing bug in conv1 backward (implausible GFLOP/s). Also, this benchmark was run with minibatch=256, inconsistent with all others, where minibatch=128.
i7-5930K (614 GF/s) | |||||||
---|---|---|---|---|---|---|---|
OpenBLAS | CaffeMKL | IntelCaffe | |||||
MF | ms | GF/s | ms | GF/s | ms | GF/s | |
conv1 forward: | 141 | 304.795 | 59.2 | 357.069 | 101.1 | 88.196 | 409.3 |
conv1 backward: | 282 | 536.807 | 67.2 | 330.767 | 218.3 | 93.893 | 768.9 |
conv1/relu forward: | 21.8936 | 47.1887 | 7.79 | ||||
conv1/relu backward: | 28.5025 | 57.7991 | 12.544 | ||||
pool1/3x3_s2 forward: | 85.0495 | 216.237 | 8.542 | ||||
pool1/3x3_s2 backward: | 45.7551 | 92.3998 | 18.194 | ||||
conv2/5x5_s1 forward: | 448 | 361.393 | 158.7 | 533.456 | 215.0 | 251.792 | 455.5 |
conv2/5x5_s1 backward: | 896 | 687.499 | 166.8 | 1007.32 | 227.7 | 684.775 | 335.0 |
cpnv2/relu forward: | 15.8821 | 32.5319 | 5.629 | ||||
cpnv2/relu backward: | 20.6075 | 41.8427 | 9.129 | ||||
pool2/3x3_s2 forward: | 67.7179 | 138.228 | 6.104 | ||||
pool2/3x3_s2 backward: | 35.3347 | 71.5279 | 13.084 | ||||
conv3/3x3_s1 forward: | 224 | 254.672 | 112.6 | 207.55 | 276.3 | 126.165 | 454.5 |
conv3/3x3_s1 backward: | 448 | 385.527 | 148.7 | 415.731 | 275.9 | 285.18 | 402.2 |
conv3/relu forward: | 7.8402 | 15.1693 | 2.503 | ||||
conv3/relu backward: | 9.814 | 19.5894 | 4.08 | ||||
conv4/3x3_s1 forward: | 299 | 424.084 | 90.2 | 321.758 | 237.9 | 169.798 | 450.8 |
conv4/3x3_s1 backward: | 598 | 660.748 | 115.8 | 658.584 | 232.5 | 382.091 | 400.7 |
conv4/relu forward: | 5.3955 | 10.159 | 1.559 | ||||
conv4/relu backward: | 6.7793 | 13.0562 | 2.705 | ||||
conv5/3x3_s1 forward: | 199 | 282.846 | 90.1 | 218.284 | 233.4 | 113.286 | 449.7 |
conv5/3x3_s1 backward: | 398 | 428.887 | 118.8 | 435.634 | 233.9 | 256.046 | 397.9 |
conv5/relu forward: | 5.4022 | 10.1855 | 1.583 | ||||
conv5/relu backward: | 6.4006 | 12.9499 | 2.86 | ||||
pool5/3x3_s2 forward: | 34.1529 | 53.2655 | 1.958 | ||||
pool5/3x3_s2 backward: | 15.0692 | 31.1371 | 4.043 | ||||
fc6 forward: | 75 | 41.5847 | 232.4 | 42.6512 | 453.1 | 72.547 | 266.4 |
fc6 backward: | 151 | 79.5084 | 243.1 | 77.2451 | 500.4 | 132.581 | 291.6 |
fc7 forward: | 34 | 18.6208 | 230.7 | 20.1991 | 425.3 | 33.675 | 255.1 |
fc7 backward: | 67 | 29.3293 | 292.9 | 37.3001 | 460.6 | 61.519 | 279.3 |
fc8 forward: | 8 | 4.2152 | 248.8 | 5.7591 | 364.1 | 8.904 | 235.5 |
fc8 backward: | 16 | 7.5515 | 277.7 | 9.0703 | 462.4 | 15.573 | 269.3 |
Average Forward | 1428 | 1935.58 | 94.4 | 2259.9 | 161.8 | 1026.8 | 356.1 |
Average Backward | 2856 | 2984.16 | 122.5 | 3312.14 | 220.8 | 1982.95 | 368.8 |
Average Forward-Backward: | 4285 | 4919.8 | 111.5 | 5572.1 | 196.9 | 30845 | 355.6 |
Total Time: | 55721 | 30845 |
@ozabluda just note that IntelCaffe uses "minimum time over all runs" for the per-layer numbers, whereas regular Caffe uses "average time over all runs". That's one reason why I didn't do a per-layer breakdown.
@soumith, does it really matter? Typically only first iteration differs much from others.
@scott-gray:
In related news, I just finished the first winograd fprop/bprop fp32 kernel. It is fully fused and requires no additional memory. But the big news is that it runs fastest at a minibatch size of 8. And by fast I mean close to 10 virtual Tflops. It is full utilization and is primarily power limited. The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops. I have a fair amount of tuning I want to try with it to see if I can boost cache performance,
Awesome. Fastest at a minibatch size of N=8 is awesome, but weird (cache performance?), because, in addition to less work for GPU, you amortize filer transforms over smaller N.
10 virtual Tflops/6.144 actual Tflops=163% "utilization" (using @andravin's terminology). Why so little? Gimme, gimme, gimme :-). Assuming you implemented F(2x2,3x3), max theoretic utilization [1] is (60+4)/(16+4)=320% [2] For N=8, we can't neglect Filter transform (28 FLOP): (60+4)/(16+4+28/8)=272%
[1] by my calculation, different from paper, please correct me if I am wrong
[2] except for the first layer (conv1.1), where we can't neglect 24 FLOP in Inverse transform amortized over only C=3 input channels, and there are only 2 reductions per output per input channel, for the total max theoretical utilization: (60+2 * 4/3)/(16+(2 * 4+24)/3)=235%. For N=8 (60+2 * 4/3)/(16+(2 * 4+24)/3 + 28/8) = 208%. On GPU (but not CPU) conv1.1 is i/o bound anyway, so no utilization improvement is actually possible.
@rsdubtso your suggested flags didn't make much difference -- IntelCaffe went from 3052 ms to 3000 ms
Intel released a small blog-post recently covering that they have crazy-talk speeds for ConvNets on their Haswell CPU line. I took their Caffe implementation, painfully installed the dependencies, and the numbers look almost too good to be true. Either someone refutes me, or these are very cool numbers.
Link to blog-post: https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors
A full [forward + backward] on AlexNet on a Desktop 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz takes an average of 164ms EDIT: 268 ms.
Just for comparison, the latest and greatest NVIDIA Titan-X does the same round-trip in 96 ms. An older generation GPU like Tesla K40 is slower, pegging at around 200+ ms.
I tried to get VGG working, but ran into assertions about unimplemented code pathways, but regardless, if AlexNet seems to be this fast, the others will probably in the ballpark.
Can someone else try the Intel stuff? I need a couple more sanity checks before I can believe this result. Look at how little time they are spending in the convolution layers, even the biggest ones: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365