soumith / convnet-benchmarks

Easy benchmarking of all publicly accessible implementations of convnets
MIT License
2.67k stars 577 forks source link

[October 2015] Intel are CPU magicians. But there's no one weird trick.... #59

Open soumith opened 8 years ago

soumith commented 8 years ago

Intel released a small blog-post recently covering that they have crazy-talk speeds for ConvNets on their Haswell CPU line. I took their Caffe implementation, painfully installed the dependencies, and the numbers look almost too good to be true. Either someone refutes me, or these are very cool numbers.

Link to blog-post: https://software.intel.com/en-us/articles/single-node-caffe-scoring-and-training-on-intel-xeon-e5-series-processors

A full [forward + backward] on AlexNet on a Desktop 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz takes an average of 164ms EDIT: 268 ms.

Just for comparison, the latest and greatest NVIDIA Titan-X does the same round-trip in 96 ms. An older generation GPU like Tesla K40 is slower, pegging at around 200+ ms.

I tried to get VGG working, but ran into assertions about unimplemented code pathways, but regardless, if AlexNet seems to be this fast, the others will probably in the ballpark.

Can someone else try the Intel stuff? I need a couple more sanity checks before I can believe this result. Look at how little time they are spending in the convolution layers, even the biggest ones: https://github.com/soumith/convnet-benchmarks/blob/cpu/intel_optimized_technical_preview_for_multinode_caffe_1.0/output_alexnet.log#L329-L365

scott-gray commented 8 years ago

@ozabluda: Yes this is F(2x2,3x3). This requires a batch of 16 gemms. I'm able to fit this all in one block for K=32 and 4 overlapping coordinates of x,y each with with 8 units of minibatch. So really it's 16 32x32 tiles. The in block overlap is key as that's what gives you such a high L1 hit rate, otherwise the you'd be bandwidth bound on L2. I use the standard 8 register gemm blocking so that means 64 FFMA's per outer product. But instead of having 1 big loop of 8 outer products, I split it in two loops of 4 (or 256 FFMAs each). 1 loop (128 threads) does the image transform inline and the other (128 threads) the filter transform (256 threads total). I can fit two blocks on an SM to cover bar.sync latencies.

Anyway, all the transform logic, pointer arithmetic, predicating and loop logic requires 138 clock consuming instructions interspersed with the FFMAs. This drops performance about 138/512=27%. This kernel is so dense with memory operations (even if they're mostly cache hits) that there's little opportunity for the boost clock to add much performance. I even have a bit of instruction cache thrashing going on because the total loop size is slightly over the size of the instruction cache.

With more shared memory and/or registers I'd have a lot more headroom to increase the tile size a bit and reduce bandwidth (as well as transform overhead). Perhaps Pascal will provide that.

Anyway, I'll have a much more detailed write up forthcoming (probably as an addition to Andrew's paper).

benanne commented 8 years ago

Awesome, looking forward to that!

ozabluda commented 8 years ago

@rsdubtso,

Hi, I'm one of the developers who worked on this package. [...] Looking at the logs, I see that the speedups for the convolution layers are not as high as we'd expect, but we never ran on a 6-core machine, so maybe our expectations are wrong. The CPU convolution layers often call tall and skinny SGEMMs which have limited scalability for a 2x18-core machine. But on a 6-core machine the gap between the SGEMM-based convolution and the approach we used may be much more narrow.

Thank you very much for your work. Intel's announcement was on on 2xE5-2699v3 (18 core):

Forward pass: 1449 MFLOP * 731images/1sec=1.059 TFLOP/s Forw+Backw pass: 3 * 1449 MFLOP * 271 images/1sec=1.187 TFLOP/s

I estimate peak FLOP/s for E5-2699v3 like so: 18 cores * 32 FLOP/cycle * 1.9GHz (AVX base clock) =1.094 TFLOP/s

Performance numbers above look to me much more like ~100% utilization on 1 CPU with very poor scalability to 2 CPUs than 50% utilization on 1 CPU with excellent scalability to 2 CPU. It would be nice to have performance numbers for 1 CPU (especially more "normal" 16core E5-2698v3 and such).

FWIW, on my dual 8-core E5-2640v3 2.60 GHz, scalability of Caffe/OpenBLAS to the second CPU is almost zero. On 1 CPU, scalabity from 4 to 8 cores is so poor that my 4-core i5-4670K 3.50 GHz outperforms it by 1.5-2.2x in convolutional layers. I didn't try MKL yet.

andravin commented 8 years ago

@ozabluda F(2x2,3x3) has a maximum speedup of (2x2x3x3)/(4x4) = 2.25. In general the max speedup for F(mxn, rxs) is (m n r s) / ((m+r-1)(n+s-1))

ozabluda commented 8 years ago

@andravin

F(2x2,3x3) has a maximum speedup of (2x2x3x3)/(4x4) = 2.25

This is the number in the paper from below (9). But this is multiplications only. What about additions? Standard direct convolution also uses 6*4=24 additions, for the total 36+24=60 FLOP.

Both direct and Winograd convolutions also use 4 unamortized additions (1 reduction per output, asymptotically for K>>1), for the max possible utilization: (60+4)/(16+4)=3.2

It's quite possible that I got it wrong. Grateful for corrections.

andravin commented 8 years ago

Because multiplication, addition, and multiply accumulate all have the same throughput, I count them all equally. That not only makes the analysis simpler, but gives you a more accurate accounting of how many arithmetic instructions you will need to execute in order to implement the algorithm.

In any case, if you wanted to count FLOPs instead of floating point instructions (FLIPs?), you would have to count the additions used in the reductions across channels, which make up for the additions that are missing from your accounting of Winograd convolution FLOPs.

ozabluda commented 8 years ago

I see. It would totally make sense to introduce FLIP, AKA "arithmetic complexity" from your paper. Direct uses 36 FLIP and Winograd F(2x2,3,3) uses 16 FLIP. We can't use "FLOP", because existing terminology is too ingrained with 2 FLOP per 1 MAC/FMAC. For example, in this thread, this is how Titan-X has 6.144 TFLOP/s (3.072 TFLIP/s), Haswell can do 32 FLOP/clock (16 FLIP/clock), AlexNet(1-col) Forward pass w/direct convolutions has 1428 MFLOP (714 MFLIP) per image, etc.

FLIP (additions) may be slightly cheaper that FLOP, because they generate less heat, and CPU/GPU may clock higher, maybe for E5-2699v3 in the range 1.9GHz (AVX base clock) vs 2.6 GHz (AVX Max All Core Turbo Frequency) 2.6/1.9=1.4

@scott-gray's, sorry I couldn't quite follow, do you count this in FLOP or FLIP:

138 clock consuming instructions interspersed with the FFMAs. This drops performance about 138/512=27%

FLIP: 2.25 * (1-138/512)=1.64 (i.e. you are already at the theoretical (practical) max) FLOP: 3.2 * (1-138/512)=2.34 (i.e ???)

scott-gray commented 8 years ago

2.25(1-138/512)=1.64 was how I was calculating it. Basically any instruction in the gemm loop that isn't dual issued dilutes the number of FFMA's that can be processed. In this case there are a lot (138) but it turns out to not be such a bad thing as this kernel is right on the edge of being bandwidth limited. I'm working on a scanning back and forth square wave block id remapping to see if that increases cache hits and drops power use a bit so the boost clock can kick in more. I'll also do the fp16 version too. I'm pretty sure that will have a lot more headroom. Though this kernel has very few remaining instruction slots to insert the F2F.F32.F16s (dual issued) but I'm pretty sure I can squeak them in.

rsdubtso commented 8 years ago

@andravin,

Are other processors (eg i7) affected by AVX2 frequencies, if so where can we find documentation of the AVX2 frequencies for those processors?

Probably the CPU support folks will have a better answer that I can find.

@soumith,

your suggested flags didn't make much difference -- IntelCaffe went from 3052 ms to 3000 ms

Thanks. That means that the OS already did a fine job scheduing the threads...

@ozabluda,

Performance numbers above look to me much more like ~100% utilization on 1 CPU with very poor scalability to 2 CPUs than 50% utilization on 1 CPU with excellent scalability to 2 CPU. It would be nice to have performance numbers for 1 CPU (especially more "normal" 16core E5-2698v3 and such)

Thanks, this is an interesting observation. If I interpret you correctly, you're saying that we're running at 50% efficiency of the whole machine. My anecdotal evidence from the times when we were tuning the benchmark, is that the conv layers do scale pretty well with the number of threads, but I do not have any numbers handy. I'll post new numbers from single socket runs next week.

I'll also try to find out if we see the regression in fc layers in our setup. This came quite as a surprise...

ozabluda commented 8 years ago

@scott-gray, I think your 10 vTFLOP/s, 163% utilization for F(2x2,3x3), K32xN8 is more impressive that you modestly describe. For N>>1, K>>1 max theoretical utilization is 36/16=2.25. But for N=8, you can't neglect Fliter transform (28 FLIP from the paper), amortized over N. For K=32, you can't neglect data transform (32 FLIP from the paper), amortized over K. 36/(16+28/8+32/32)=1.76 [1]. Those 28+32=60 FLIP are part of your 138 "apparent overhead" instructions, but only 138-60=78 instructions is "true overhead", i.e. 78/512=15%. Or I am counting this incorrectly: I should be assuming that Filter transform could have been reused across tiles (when N>8), ditto with data transform (when K>32)?

[1] As @andravin describes in the paper, data transform over overlapped regions can be reused, and instead of 32, I counted min theoretical 24 additions (in his CPU implementation he has 80/3=26.7 additions), but it doesn't matter in this case: 36/(16+28/8+24/32)=1.78

The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops.

this must be a typo. For C=3, you can't neglect inverse, amortized over C, which is 24 additions (from the paper): 36/(16+28/8+32/32+24/3)=1.26. 1.26 * 6.144 TF/s = 7.76 vTF/s. You probably meant 6.7 vTflops.

scott-gray commented 8 years ago

Sharing transform code from overlapped regions is much harder in practice than it may seem. Working around all the constraints means you need an additional pass through shared memory or perhaps warp shuffles.. which adds more overhead. What I have right now with each thread computing one transform works pretty well.

As far as performance goes, the bigger your tile size the fewer repeated transforms you have to make, but there's only so much I can fit in the limited shared memory available. I could do a non-fused kernel and only do the minimum number of transforms but that also adds a lot of overhead and I'm pretty sure it wont run as fast as a fused kernel that is computing extra transforms inline. On the filter side there's also the benefit of only needing 9 loads instead of 16 if you do the transform inline.

But counting the non-ffma clock consuming instructions is all you need to figure out max performance. And this bares out in my testing.

For C=3, that's just 1 pass through the gemm loop at 3/4 utilization. But all the gemm setup + ouput code also has overhead that's hugely amplified by such a small time spent in gemm.

ozabluda commented 8 years ago

@scott-gray on reddit:

Once I have these kernels well tuned I'll move onto the much more complicated F(4x4,3x3) transform where as much as a 4x speedup may be possible (though on the GPU there's no avoiding the small transform overhead or the inefficiencies in the awkward 6x6 dimensions).

As paper says, for F(4x4,3x3), everything infinite, asymptotically, speedup would indeed be 144/36=4.

But for K32xN8 tiling block (can it be that large?), taking floating point instruction counts from the paper, theoretical max possible utilization for is:

For C>>3: 144/(36+156/32+72/8)=2.9 For C==3: 144/(36+156/32+72/8+90/3)=1.8

scott-gray commented 8 years ago

I clarified the tiling I was using in a later post here. It's actually 32x32, not 32x8. The 32x8 is what is visible to the user, but 32x32 is how it actually works. The outer product dims of the batched gemm are K and Y/4_X/4_N. So I don't just have 8 points of N on the outer product, but 4 sets x,y coordinates of 8 points of N arranged in a 2x2 superblock. With the 2 units of overlap in each direction, this hugely increases the utilization of the L1 cache and its what makes it possible for this kernel to have such dense global loads (16 loads in ~256 cycles is a lot).

I'm actually working on a 2x1 superblock for fp16 (2xy points of 16n) so as to eliminate the half empty 32 byte transaction size.

ozabluda commented 8 years ago

I clarified the tiling I was using in a later post here. It's actually 32x32, not 32x8. The 32x8 is what is visible to the user, but 32x32 is how it actually works.

I think I kinda understood a little bit the main idea how you get high L1 utilization, removing L2 bottleneck, but I don't understand how this can help with max theoretical peak FLIP/s calculation I am making. You still can't amortize filter transform over "effective" N=32, only over real N=8. Or can you?

scott-gray commented 8 years ago

x and y also factor into the number of image transforms you need, not just n. So 32 is the unit you need to use when calculating redundant transforms.

ozabluda commented 8 years ago

Aha! I get it now.

For F(4x4,3x3) correct formula for K32xN8, X2xY2(=4) is

For C>>3: 144/(36+156/32+72/4/8)=3.3 For C==3: 144/(36+156/32+72/4/8+90/4/3)=2.8

For the overlapped data transform, the correct number of FLIP is actually smaller than 156.

Last convolutional layer of VGG image dimention is 6x6, preventing X2xY2 superblock tiling. For that layer:

For C>>3: 144/(36+156/32+72/8)=2.9

For F(2x2,3x3) correct formula for K32xN8, X2xY2(=4) is

For C>>3: 36/(16+32/32+28/4/8)=2.01 (1.63 actually achieved) For C==3: 36/(16+32/32+28/4/8+24/4/3)=1.81

For the overlapped data transform, the correct number of FLIP is actually smaller than 32, but, since it's at least 24 (by my calculation), it doesn't matter for K=32:

For C>>3: 36/(16+24/32+28/4/8)=2.04 For C==3: 36/(16+24/32+28/4/8+24/4/3)=1.83

ozabluda commented 8 years ago

@scott-gray

The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops.

initially, I though this was a typo (as Titan-X has 6.144 real Tflops). Now I think it may mean an awesome 1.7 utilization (C=3, theoretical max utilization is 1.8, see previous comment), although it's weird, because C=3 is i/o bound.

rsdubtso commented 8 years ago

@ozabluda, Here's some data from a 2xE5-2697v3 machine (sorry, could did not have a desktop machine with a proper OS handy).

My colleague timed IntelCaffe on 14 and 28 cores (1 and 2 sockets). Affinity setup: KMP_AFFINITY=granularity=fine,compact,1,0. MKL version was 11.3.0.

There are quite a few cases when the ratio is less than 2 and even some cases where it is less than 1, but the most time-consuming layers have scaled pretty well. The total ratio is 1.83.

layer dir 14 28 Ratio 14/28
data forward: 79.75 69.87 1.14
data backward: 0.00 0.00 N/A
pack1 forward: 9.35 4.74 1.97
pack1 backward: 0.00 0.00 2.00
conv1 forward: 133.78 66.79 2.00
conv1 backward: 101.10 56.35 1.79
relu1 forward: 12.29 5.94 2.07
relu1 backward: 17.14 8.42 2.04
norm1 forward: 45.06 22.82 1.97
norm1 backward: 67.51 32.20 2.10
pool1 forward: 16.82 8.57 1.96
pool1 backward: 27.53 13.77 2.00
conv2 forward: 163.55 107.55 1.52
conv2 backward: 416.93 208.84 2.00
relu2 forward: 7.83 3.77 2.08
relu2 backward: 10.92 5.34 2.05
norm2 forward: 28.15 14.92 1.89
norm2 backward: 43.41 20.60 2.11
pool2 forward: 10.23 5.31 1.93
pool2 backward: 17.34 8.76 1.98
conv3 forward: 105.66 52.77 2.00
conv3 backward: 228.13 114.76 1.99
relu3 forward: 2.22 1.02 2.17
relu3 backward: 3.63 1.82 1.99
conv4 forward: 81.77 40.92 2.00
conv4 backward: 176.46 88.66 1.99
relu4 forward: 2.21 0.88 2.51
relu4 backward: 3.71 1.81 2.04
conv5 forward: 56.16 28.06 2.00
conv5 backward: 120.98 60.74 1.99
relu5 forward: 1.26 0.42 2.96
relu5 backward: 2.79 0.82 3.40
pool5 forward: 2.30 1.06 2.17
pool5 backward: 3.82 0.81 4.73
unpack6 forward: 0.35 0.17 2.02
unpack6 backward: 0.24 0.19 1.30
fc6 forward: 23.39 13.01 1.80
fc6 backward: 41.84 22.64 1.85
relu6 forward: 0.06 0.14 0.38
relu6 backward: 0.12 0.05 2.58
drop6 forward: 4.06 4.04 1.00
drop6 backward: 0.12 0.07 1.68
fc7 forward: 11.14 5.99 1.86
fc7 backward: 18.95 10.08 1.88
relu7 forward: 0.05 0.13 0.39
relu7 backward: 0.09 0.05 1.90
drop7 forward: 4.04 4.04 1.00
drop7 backward: 0.16 0.09 1.69
fc8 forward: 2.59 1.87 1.38
fc8 backward: 5.26 3.10 1.70
loss forward: 11.03 11.27 0.98
loss backward: 0.18 0.22 0.79
Min Forward 818.78 477.70 1.71
Min Backward 1309.69 661.51 1.98
Min Forward-Backward: 21607.00 11820.00 1.83
soumith commented 8 years ago

@rsdubtso Taking the minimum timing of each layer rather than the average is a bit misleading and is not a standard in benchmarking. I think you should consider changing that, even though the overall difference might be minor.

ozabluda commented 8 years ago

@rsdubtso, thank you, these are great. I see great scalability to 2 sockets, with ~50% utilization (either one or two sockets), exactly opposite of my earlier guesses. Next natural experiments would be to run it on 1,2,4,8 cores to see where utilization breaks down (are you using AVX2 MADD?)

E5-2697v3 has: 14 cores * 32FLOPs/cycle * 2.2 GHz (AVX Core Freq) = 986 GFLOP/s (AVX boost goes from 2.9-3.3 GHz, depending on the number of cores active)

Note that you ran 2-col AlexNet with minibatch=256, while @soumith ran 1-col AlexNet with minibatch=256

IntelCaffe mb=256 E5-2697v3
14 core 28 core
MFLOP ms Utilization ms Utilization
conv1 forward: 211 133.78 41% 66.79 41%
conv1 backward: 422 101.1 108% 56.35 97%
conv2 forward: 448 163.55 71% 107.55 54%
conv2 backward: 896 416.93 56% 208.84 56%
conv3 forward: 299 105.66 73% 52.77 74%
conv3 backward: 598 228.13 68% 114.76 68%
conv4 forward: 224 81.77 71% 40.92 71%
conv4 backward: 448 176.46 66% 88.66 66%
conv5 forward: 150 56.16 69% 28.06 69%
conv5 backward: 300 120.98 64% 60.74 64%
fc6 forward: 75 23.39 84% 13.01 75%
fc6 backward: 75 41.84 47% 22.64 43%
fc7 forward: 34 11.14 79% 5.99 74%
fc7 backward: 34 18.95 47% 10.08 44%
fc8 forward: 8 2.59 80% 1.87 56%
fc8 backward: 8 5.26 39% 3.1 34%
Conv+fc Forward 1449 578.04 65% 316.96 59%
Conv+fc Backward 2781 1109.65 65% 565.17 64%
Conv+fc Forward-Backward: 4231 1687.69 51% 882.13 46%
gujunli commented 8 years ago

@rsdubtso The data transfer time seems very long and does not scale well. Could you offer more details how is it designed? Thanks!

gujunli commented 8 years ago

@rsdubtso Also, the relu forward seems much slower on two sockets. Why is that? Drop6 and Drop7 seems to still use one socket even when you have two socket? the scaling ratio is 1.

rsdubtso commented 8 years ago

@andravin

Are other processors (eg i7) affected by AVX2 frequencies, if so where can we find documentation of the AVX2 frequencies for those processors?

Probably the CPU support folks will have a better answer that I can find.

I asked around, and here's what I was told: AVX frequency is not SW visible. But even desktop processors have a fused 'AVX' frequency that they throttle down to when executing heavy instructions. I could not find the frequency fused for the i7 CPU mentioned above, but you can find it out using prime95 v27.9 or later, for example. However, current-related throttling may occur earlier than you hit TDP budget limit related to heavy instructions.

emfomenk commented 8 years ago

Hi all, I worked with @rsdubtso on the package too.

@soumith, you are right, we should've pointed we report timings for the fastest iteration. Though, if you use the same package for comparing 'intel_alexnet' and 'bvlc_alexnet', the comparison will be quite representative.

@gujunli relu1-5 scale well, relu6-7 seem to be too small for scaling across sockets. drop6 and drop7 use rng (not parallelized), which most likely takes most time. We didn't optimize drop layer, except for adding parallelization on the loop.

@ozabluda, @gujunli, I rerun the package on the same machine @rsdubtso did. The only change here is that I put database on /tmp (local hard drive). @rsdubtso reported timings when the DB was on Lustre FS (distributed cluster filesystem). That was the reason, why the timings were pure for data layer. We didn't change data layer much, only added simple parallelization on preparation of image-minibatch.

Iterations: 10

layer direction omp omp omp omp omp cmp cmp cmp cmp
28 14 8 4 2 28.vs.14 14.vs.8 8.vs.4 4.vs.2
-------- -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
data forward: 18.51 23.93 24.57 24.94 29.82 0.64 0.58 0.50 0.59
data backward: 0 0 0 0.00 0 N/A N/A N/A 0
pack1 forward: 4.70 9.41 10.11 14.26 25.03 1.00 0.61 0.70 0.87
pack1 backward: 0.00 0.00 0.00 0.00 0.00 0.5 0.57 0.5 0.5
conv1 forward: 66.96 133.95 212.41 345.24 612.40 1.00 0.90 0.81 0.88
conv1 backward: 56.47 101.13 170.32 328.05 650.71 0.89 0.96 0.96 0.99
relu1 forward: 5.96 12.46 12.29 12.86 19.19 1.04 0.56 0.52 0.74
relu1 backward: 8.4 17.15 17.41 20.12 32.11 1.02 0.58 0.57 0.79
norm1 forward: 22.99 44.78 64.66 126.46 251.12 0.97 0.82 0.97 0.99
norm1 backward: 31.86 63.13 67.77 95.95 169.04 0.99 0.61 0.70 0.88
pool1 forward: 8.44 16.41 27.14 54.10 106.53 0.97 0.94 0.99 0.98
pool1 backward: 13.90 27.66 27.97 34.21 54.56 0.99 0.57 0.61 0.79
conv2 forward: 105.79 164.55 282.32 561.36 1120.98 0.77 0.98 0.99 0.99
conv2 backward: 208.96 416.54 712.46 1415.67 2826.54 0.99 0.97 0.99 0.99
relu2 forward: 3.79 7.81 7.76 8.32 12.37 1.02 0.56 0.53 0.74
relu2 backward: 5.35 10.90 11.19 12.94 20.60 1.01 0.58 0.57 0.79
norm2 forward: 14.71 28.47 41.42 81.56 162.21 0.96 0.83 0.98 0.99
norm2 backward: 20.93 40.58 43.59 60.95 108.32 0.96 0.61 0.69 0.88
pool2 forward: 5.25 10.19 16.94 33.66 66.53 0.97 0.95 0.99 0.98
pool2 backward: 8.76 17.97 17.89 21.02 33.87 1.02 0.56 0.58 0.80
conv3 forward: 52.78 105.76 182.96 363.68 725.27 1.00 0.98 0.99 0.99
conv3 backward: 115.63 228.32 396.63 784.72 1562.49 0.98 0.99 0.98 0.99
relu3 forward: 1.04 2.27 2.26 2.73 4.27 1.09 0.56 0.60 0.78
relu3 backward: 1.88 3.62 3.95 4.55 7.17 0.96 0.62 0.57 0.78
conv4 forward: 40.89 81.86 139.66 275.17 547.34 1.00 0.97 0.98 0.99
conv4 backward: 88.91 176.64 301.33 595.64 1183.89 0.99 0.97 0.98 0.99
relu4 forward: 0.89 2.21 2.24 2.68 4.55 1.23 0.57 0.59 0.84
relu4 backward: 1.82 3.76 3.91 4.60 7.18 1.02 0.59 0.58 0.77
conv5 forward: 28.07 56.34 94.83 185.45 368.19 1.00 0.96 0.97 0.99
conv5 backward: 60.71 120.89 204.81 401.62 797.17 0.99 0.96 0.98 0.99
relu5 forward: 0.42 1.20 1.33 1.65 3.00 1.43 0.63 0.61 0.90
relu5 backward: 0.81 2.78 2.80 3.11 4.79 1.71 0.57 0.55 0.76
pool5 forward: 1.06 2.25 3.72 7.35 14.68 1.06 0.94 0.98 0.99
pool5 backward: 0.81 3.85 3.79 4.57 7.23 2.37 0.56 0.60 0.79
unpack6 forward: 0.16 0.37 0.48 0.81 1.59 1.17 0.73 0.83 0.98
unpack6 backward: 0.18 0.24 0.38 0.74 1.46 0.67 0.88 0.97 0.98
fc6 forward: 13.01 23.54 36.53 72.21 137.74 0.90 0.88 0.98 0.95
fc6 backward: 22.70 41.92 67.19 133.06 264.57 0.92 0.91 0.99 0.99
relu6 forward: 0.14 0.06 0.03 0.06 0.13 0.21 0.33 0.94 0.96
relu6 backward: 0.04 0.11 0.09 0.19 0.36 1.25 0.46 1.06 0.94
drop6 forward: 4.01 4.09 4.11 4.28 4.61 0.50 0.57 0.52 0.53
drop6 backward: 0.07 0.12 0.18 0.37 0.72 0.86 0.85 1.00 0.95
fc7 forward: 6.05 11.21 16.81 33.46 65.28 0.92 0.85 0.99 0.97
fc7 backward: 10.07 18.99 30.49 60.42 118.79 0.94 0.91 0.99 0.98
relu7 forward: 0.12 0.05 0.03 0.06 0.13 0.20 0.39 0.94 0.96
relu7 backward: 0.05 0.09 0.12 0.17 0.32 0.88 0.72 0.71 0.94
drop7 forward: 4.01 4.08 4.08 4.28 4.60 0.50 0.57 0.52 0.53
drop7 backward: 0.08 0.17 0.20 0.38 0.73 1.00 0.67 0.91 0.96
fc8 forward: 1.85 2.59 4.47 8.55 15.92 0.69 0.98 0.95 0.93
fc8 backward: 3.09 5.28 7.53 14.66 28.83 0.85 0.81 0.97 0.98
loss forward: 11.19 11.00 10.82 10.91 10.73 0.49 0.56 0.50 0.49
loss backward: 0.22 0.17 0.17 0.17 0.20 0.39 0.56 0.50 0.57
all Forward 423.95 761.99 1206.21 2237.89 4318.62 0.89 0.90 0.92 0.96
all Backward 662.77 1303.28 2093.61 4000.3 7884.59 0.98 0.91 0.95 0.98
all Fwd-Bwd 11080 20916 33242 62807 122290 0.94 0.91 0.94 0.97

small comment on cmp columns: reported formula for X.vs.Y is: (time_Y/time_X)*(Y/X) -- i.e. parallelization efficiency

ozabluda commented 8 years ago

@rsdubtso:

I could not find the frequency fused for the i7 CPU mentioned above, but you can find it out using prime95 v27.9 or later, for example. However, current-related throttling may occur earlier than you hit TDP budget limit related to heavy instructions.

Thank you for checking. Even though Intel's documentation does lists current and power limits, I think all(?) Intel CPUs are in practice limited only by TDP. For example, overclocked Intel CPUs are known to suck 400W on Prime95 likely with long-term damage, and Intel CPUs don't prevent it, if cooled:

From official Asus overclocking guide: “”” In our testing to date, the average overclocked frequency for 5960X processors is 4.5GHz. Very good processors will achieve 4.6GHz fully stable with less than 1.30Vcore. […] Users should avoid running Prime95 small FFTs on 5960X CPUs when overclocked. Over 4.4GHz, the Prime software pulls 400W of power through the CPU. “”” http://rog.asus.com/365052014/overclocking/rog-overclocking-guide-core-for-5960x-5930k-5820k/

ozabluda commented 8 years ago

@emfomenk, thanks for the excellent table. Looking only at conv and fc layers, I see excellent scalability 2=>4=>8=>14=>28 in conv layers (except 2=>4=>8=14 in conv1 forward and 14=>28 in conv1 forward,conv2 backward), and some degradation in scalbiltiy 8=>14 and 14=>28 in fc layers. Updating my utilization table for 2 cores, we see that utilization improved conv+fc forward 65%=>73% (maybe some of it is due to AVX clock boost?), while conv+fc backward didn't improve much (65%=>68%). We can see that utilization does/doesn't improve for 2 cores. Now, the only thing missing is 1 core :-)

E5-2697v3 with 2 cores has: 2 cores * 32FLOPs/cycle * 2.2 GHz (AVX Core Freq) = 141 GFLOP/s (AVX boost goes from 2.9-3.3 GHz, depending on the number of cores active)

IntelCaffe mb=256 E5-2697v3
14 core 28 core 2 core
MFLOP ms Util ms Util ms Util
conv1 forward: 211 133.78 41% 66.79 41% 612.4 63%
conv1 backward: 422 101.1 108% 56.35 97% 650.71 118%
conv2 forward: 448 163.55 71% 107.55 54% 1120.98 73%
conv2 backward: 896 416.93 56% 208.84 56% 2826.54 58%
conv3 forward: 299 105.66 73% 52.77 74% 725.27 75%
conv3 backward: 598 228.13 68% 114.76 68% 1562.49 69%
conv4 forward: 224 81.77 71% 40.92 71% 547.34 74%
conv4 backward: 448 176.46 66% 88.66 66% 1183.89 69%
conv5 forward: 150 56.16 69% 28.06 69% 368.19 74%
conv5 backward: 300 120.98 64% 60.74 64% 797.17 68%
fc6 forward: 75 23.39 84% 13.01 75% 137.74 100%
fc6 backward: 75 41.84 47% 22.64 43% 264.57 52%
fc7 forward: 34 11.14 79% 5.99 74% 65.28 95%
fc7 backward: 34 18.95 47% 10.08 44% 118.79 52%
fc8 forward: 8 2.59 80% 1.87 56% 15.92 91%
fc8 backward: 8 5.26 39% 3.1 34% 28.83 50%
Conv+fc Forward 1449 578.04 65% 316.96 59% 3593.12 73%
Conv+fc Backward 2781 1109.65 65% 565.17 64% 7432.99 68%
Conv+fc F/B: 4231 1687.69 51% 882.13 46% 11026.11 63%
emfomenk commented 8 years ago

Just quick update. Recently we released technical preview of Multinode Caffe. The link: https://software.intel.com/en-us/articles/caffe-training-on-multi-node-distributed-memory-systems-based-on-intel-xeon-processor-e5

The results are shown for Alexnet. We use data parallelism (for the first half of the net: from data till pool5) as well as model parallelism (for the second half: from fc6 till the end). The behavior of Multinode Caffe almost duplicates the behavior of Singlenode Caffe. This puts some limitations on scalability. Though we were able to achieve 12.3x, 19.2x and 29.4x speed-up on 16, 32 and 64 nodes respectively.

ozabluda commented 8 years ago

@emfomenk, thank you for the summary. Sorry, I don't understand what you mean by

The behavior of Multinode Caffe almost duplicates the behavior of Singlenode Caffe. This puts some limitations on scalability.

I also don't understand from the article what the effective minibatch is for, say, 64 nodes. Is is still 256 i.e. 4 per node? For multinode syncronous SGD, it's probably best to switch to the 1-col AlexNet from the "One weird Trick..." paper and follow the paper.

emfomenk commented 8 years ago

@ozabluda, I mean the algorithm in Multinode Caffe (underlying math) is the same as in Singlenode Caffe: Forward, Backward, SGD, the same parameters and so on. In particular it means that there is no much possibilities to parallelize the work.

The only difference in Multinode version (from math point of view) is slightly modified SGD solver, which allows to apply diff right after backward step for current layer (this was made to be able to benefit from MPI parallelization in current approach). It looks like this modification doesn't affect convergence -- at least we were able to train Alexnet in the same amount of iterations as in Singlenode case.

Regarding minibatch: for 16 nodes minibatch=256 was used, for 32 nodes minibatch=512, and for 64 nodes minibatch=1024. It means that each node (in 16 nodes case) took 256/16=16 images in its "local" minibatch.

Yes, you are right that there are much better ways to implement multinode training (though, the math would be slightly different...), but the original idea was just to show that it possible to implement good parallelization even for this particular model.

ozabluda commented 8 years ago

I mean the algorithm in Multinode Caffe (underlying math) is the same as in Singlenode Caffe: Forward, Backward, SGD, the same parameters and so on. In particular it means that there is no much possibilities to parallelize the work.

I see. Does it mean it is approximately the same as single-node multi-GPU Caffe? What about parameter update step? Is it centralized, or also distributed, just single-node multi-GPU Caffe?

Article says """reached 80% top-5 accuracy in just over 5 hours on a 64-node""". Is that 90 epochs with minibatch=1024? AlexNet from the original paper reached 81.8% after 90 epochs with minbatch=128.

P.S. Graph incorrectly says "E5-2697 v3 18 cores"

ozabluda commented 8 years ago

Article says """reached 80% top-5 accuracy in just over 5 hours on a 64-node""". Is that 90 epochs with minibatch=1024? AlexNet from the original paper reached 81.8% after 90 epochs with minbatch=128.

Correction: 81.8% top-5 from the paper was with averaging predictions of 5 crops plus their horizontal reflections. Standard Caffe "test" does 1 random crop with no reflections, for which 80.2-80.4% top-5 is reached in 60-70 epochs, depending. How many epochs was it with minibatch=1024?

emfomenk commented 8 years ago

@ozabluda, sorry, but i am not familiar with multi-GPU Caffe. I need to look at the codes.

In Multinode Caffe for the first half of the net the parameter updates are centralized (since parallelization happens on minibatch, all convolutions parameters are the same for all nodes). For the second half updates are distributed, since fully-connected layers' weights are distributed across the nodes.

Just to be aligned: one epoch == one full database turn around. We always ran Caffe (singlenode and multinode versions) for 90 epochs (this number was just fixed). We saw that accuracy didn't improved much since ~40-50 epoch, but I didn't save intermediate snapshots and can't say for sure the accuracy after 60 or 70 epochs right now. If you want I can rerun the training and report the top-5 accuracy for these epoch numbers.

ozabluda commented 8 years ago

We always ran Caffe (singlenode and multinode versions) for 90 epochs

Great. I think the web article should say that explicitly, especially since it is actually faster than what could be guessed from """reached 80% top-5 accuracy""", which can mean as little as 40, as you noticed:

We saw that accuracy didn't improved much since ~40-50 epoch [...] If you want I can rerun the training and report the top-5 accuracy for these epoch numbers.

Thank you for the offer, knowing that it's 90 epochs is good enough for me.

Off-topic part:

I am actually more interested in the number more precise than 80% (precision like 80.xx% would be better) for minibatch=1024 [1], single model, single crop, top-5 and top-1 (Caffe can do both simultaneously). I am also interested your ultimate accuracy for minibatch=256,512 as well. As you noticed, with the growing number of nodes you have to increase minibatch size, which negatively affects accuracy.

[1] BTW, did you increase learning rate 4x, compared to minibatch=256? If yes, how did that affect accuracy? How about increasing learning rate sqtr(4)=2x?

ozabluda commented 8 years ago

This somewhat explains how Intel's Multi-node Caffe works https://github.com/BVLC/caffe/pull/3252

emfomenk commented 8 years ago

Please take a look at https://communities.intel.com/community/itpeernetwork/datastack/blog/2015/11/12/myth-busted-general-purpose-cpus-can-t-tackle-deep-neural-network-training-part-2 for more information on technical details of Intel Multinode Caffe tech-preview, which actually uses one weird trick... :) There is more on technical side. Unfortunately we didn't play with learning rate and it was always the same (the default one from bvlc_alextnet/sover.prototxt).

ducha-aiki commented 8 years ago

Speaking about accuracy, this could be used as baseline: https://github.com/BVLC/caffe/wiki/Models-accuracy-on-ImageNet-2012-val

bhack commented 8 years ago

There is now an official Intel Opencl PR at https://github.com/BVLC/caffe/pull/3355. /cc @gongzg

ozabluda commented 8 years ago

@scott-gray>Yes this is F(2x2,3x3). [...] I'm able to fit this all in one block for K=32 and 4 overlapping coordinates of x,y each with with 8 units of minibatch. [...] The in block overlap is key as that's what gives you such a high L1 hit rate, otherwise the you'd be bandwidth bound on L2.

With F(2x2,3x3), (super)block 2x2 we have tile size of 6x6. In two other dimensions the tile size is K32xN8. Outer loop is over input channels (C). With 4-byte fp32 each 6x6 (super)block (=tile) we have:

Filters: 32_3_3_4=1152 bytes Input: 6_6_8_4=1152 bytes Output: 4_4_32_8_4=16384 bytes

1 loop (128 threads) does the image transform inline and the other (128 threads) the filter transform (256 threads total). I can fit two blocks on an SM to cover bar.sync latencies.

Do I understand correctly that Filters and Input go to L1 (24 KB per SM) and output is accumulated in the registers (64k 32-bit registers per SM)? Do you use Shared Memory (96 KB per SM) at all? What limits it to two blocks on an SM?

I clarified the tiling I was using in a later post here. It's actually 32x32, not 32x8. The 32x8 is what is visible to the user, but 32x32 is how it actually works. The outer product dims of the batched gemm are K and Y/4X/4N. So I don't just have 8 points of N on the outer product, but 4 sets x,y coordinates of 8 points of N arranged in a 2x2 superblock. With the 2 units of overlap in each direction, this hugely increases the utilization of the L1 cache and its what makes it possible for this kernel to have such dense global loads (16 loads in ~256 cycles is a lot).

Do I understand correctly that the 4 thread blocks (256 threads each) that work on the same 2x2 superblock, really know nothing about each other, solely relying on L1 for transparent data reuse?

scott-gray commented 8 years ago

You can find the latest code for F(2x2,3x3) here:

https://github.com/NervanaSystems/neon/blob/master/neon/backends/kernels/sass/xconv_winograd_2x2_3x3_32x32.sass

This kernel uses 256 threads, 128 registers and 32kb shared memory. This means the threads and registers are limiting the occupancy to 2 blocks per SM and 4 warps per scheduler.

The shared memory is mainly used for storing the computed transforms and facilitating the batched gemm. The gemm tile is 32x32 and we have 16 of them in the same block. This means we only have enough shared memory to store 4 outer product lines at a time, double buffered. So the gemm loops are unrolled 4 times. We use 2 separate loops to compute the image and filter transforms inline.

When super blocking is in effect, you can get a lot of L1 cache hits, reducing the bandwidth from L2.

This implementation is currently significantly more efficient than the one found in cuDNN 5.0 and up.