soumith / convnet-benchmarks

Easy benchmarking of all publicly accessible implementations of convnets
MIT License
2.68k stars 579 forks source link

[August 2015] Rejigging the marks... #46

Closed soumith closed 8 years ago

soumith commented 8 years ago

With Cudnn R3 coming in, improvements to Nervana, and a new kid on the block called Chainer, faster Facebook kernels, I will be doing a minor re-run of the benchmarks to see how things have improved.

Target date: August 15th.

I am still thinking quite a lot on how to take the benchmarks forward, beyond ConvNets, beyond Images (into NLP, Video and Audio) and beyond single-GPU. If any domain experts have suggestions (especially for Audio and NLP), please do write to me.

The only thing that stopped me from multi-GPU benchmarks was the lack of enough frameworks to do benchmarking. This somewhat seemed to have changed, and a decent number of frameworks now support multi-GPU, so will plan on that.

More fun to come soon.

Checklist:

hughperkins commented 8 years ago

@soumith thank-you very much for providing these benchmark results. Very useful :-)

naibaf7 commented 8 years ago

@soumith Thanks - clearly there is work to do on the GreenTea convolution code. Also there is some bottleneck in backward processing that needs to be solved. Performance is expected to get much faster within the next 2 months (Batched-GEMM/GEMV).

Interesting to see how much slower an OpenCL implementation is with minibatches using identical code to CUDA Caffe - with optimized OpenCL code this will change.

bhack commented 8 years ago

Kudos to @scott-gray. Actually he has the fastest open source implementation.

naibaf7 commented 8 years ago

@bhack Probably it's a really good idea to replicate his kernels in GCN-assembly then ;)

Oh on another note, this was ViennaCL-BLAS. Probably higher performance with clBLAS for the next time.

@lunochod Have you seen this? Still a lot of work on the OpenCL (compute kernel) side. Seems like just duplicating CUDA kernels does not really give good performance (as we know already...)

scott-gray commented 8 years ago

Looks like cuDNN is catching up. Those VGG numbers are really good for an FFT implementation. Perhaps with a bit more optimization they can overtake spatial domain on 3x3 filters? I wouldn't be surprised if we see much better fp16 numbers from them soon.

My GoogLeNet numbers may look good but I still have a lot of optimizations to make for the smaller feature map values in there. Right now I'm optimized for multiples of 64. I'll get that down to 32 this weekend. My CHWN tensor layout is also really helpful on those inception groupings.

A brand new version of neon is about to be released. You'll be able to run all these networks out of the box (plus lots more). The new syntax is much improved and more torch or keras like (perhaps better even).

Anyway, here's a changelog of updates since the last version:

No more multiplying by zero to implement padding in fprop and bprop (I now slice both the input and the filter)

Figured out a different way to do integer division for the now dynamically sized slice lookup table.

No more atomic adds in bprop. I've cast bprop as fprop upside down and the kernels are nearly identical. It requires a dimshuffle on the filter but this just takes microseconds and a small amount of additional memory that can be shared with all conv ops. Bprop used to be bandwidth bound on those atomic adds.

Tweaked the p,q block ordering to improve L2 cache performance. I'm using a zigzag pattern now for all operations.

Update already had a mode where you could stack all the gemm ops to eliminate atomic adds, but I've streamlined that stacking operation. Update also now fetches 32 rows deep. This comes at the cost of an instruction cache miss inside the main gemm loop, but is easily covered by the occupancy. The reason for doing this is the same for using a 32x33 shared memory block to implement transpose. With N contiguous the update op has the expensive strided access patterns on both the input and delta.

I also eliminate all shared memory bank conflicts when storing the global loads to shared with some clever shifting.

Added a beta param to bprop to allow delta accumulation for inception groupings.

soumith commented 8 years ago

sorry, closed the issue as it got side-tracked by lots of other discussions. lets discuss more in https://github.com/soumith/convnet-benchmarks/issues/56

Scott, I'd appreciate if you re-paste your comment there to discuss further.