[August 2015] Rejigging the marks...

soumith commented 9 years ago

The benchmarks this time around are interesting, with some fairly clear trends emerging for the near future.

Looking Back

First, some appreciation for where things are,

9 months ago, we were ~3x slower on alexnet and ~4x slower on overfeat. Training that took 3 weeks, takes 1 week (on the same 1-GPU metric). That is a huge fundamental speedup results in several man-hours saved in waiting for experiments to finish.

Pushing these boundaries so fast, in such a short time-frame is quite something. There's two sets of teams who have made this happen:

NVIDIA, with their Maxwell cards that are fast as f**k
Nervana Systems (Scott Gray and team) who have pushed the CUDA kernels to the limits of the GPUs with efficiencies > 95%.
Now

The result of Nervana pushing the limits of compute means that others who were competing to be faster had to play smarter. Nervana has pushed limits so hard that the GPU cant run at boosted clock speeds for long and has to slow down a little bit.

Nervana had the flexibility to choose the ideal data layout for the task, and they used it to its maximum potential, combined with very low-level optimizations and hand-coded assembly.

The trend of the near-future

The CuDNN and Facebook teams did not have this kind of flexibility because they were working with a constraint of supporting existing frameworks such as Caffe and Torch which froze themselves to use the BDHW data layout, which is not the most ideal data layout for convolutions in the spatial domain.

Switching to FFT-based convolutions and optimizing the hell out of them was an obvious choice. However, there has been skepticism that FFT-based convolutions take too much extra memory. This was demonstrated by the Facebook convolutions (FBFFT, FBCuFFT) which were fairly fast, but took an unreasonable amount of extra memory.

However, FFT-based convolutions dont necessarily need a lot of extra memory, especially if one writes the full FFT pipeline from scratch. In fact, Nicolas Vasilache from Facebook demonstrated that FFT based convolutions dont need any extra memory with a single-threaded implementation, but he did not optimize them further to achieve competitive performance. He also showcased a tiling strategy for FFT based convolutions that speeds the convolutions up quite a bit, reducing the extra memory needed as well.

NVIDIA with their R3 release of CuDNN show that their FFT based convolutions can be very competitive in speed with Nervana kernels, and faster in some cases. (See imagenet_winners in README.md on the main page for more details)

One has to remember that FFT based convolutions take the same speed to compute regardless of the convolution size (except in a tiling FFT strategy). So if you have a 3x3 convolution layer or a 15x15, it takes the same speed.

NVIDIA fused lots of the CUDA kernels in their implementation to reduce the amount of memory needed by the FFT convolutions. This reduces the amount of extra memory needed, and it is a matter of time before which they will release completely-fused kernels which barely need any extra memory.

CUDNN (R3) Extra Memory to train Imagenet winners

Network	Extra Memory
AlexNet	324 MB
VGG-A	2.6 GB
Overfeat	2.59 GB
GoogleNet	202 MB

The overall trend I see is that:

Nervana has already pushed spatial domain convolutions and do not have any more optimizations to do to speed things up even more
FFT-based convolutions seem to have so much more room to optimize further
NVIDIA switched to focusing on optimizing FFT convolutions and have very competitive performance, which they will only improve over time.

p.s.: sorry for not finishing the Chainer benchmarks, I am having some issues running things. My chainer install seems to have some strange CUDA issues. I will update the README with those results in a week or two when I get time. Overall, my first feel of Chainer is that I am a bit annoyed at the 15/20/30 seconds it takes to compile the compute graph. If I read the documentation hard enough, I'll probably find a debug mode that starts running faster, haven't come that far yet!

scott-gray commented 9 years ago

Reposting this here:

Looks like cuDNN is catching up. Those VGG numbers are really good for an FFT implementation. Perhaps with a bit more optimization they can overtake spatial domain on 3x3 filters? I wouldn't be surprised if we see much better fp16 numbers from them soon.

My GoogLeNet numbers may look good but I still have a lot of optimizations to make for the smaller feature map values in there. Right now I'm optimized for multiples of 64. I'll get that down to 32 this weekend. My CHWN tensor layout is also really helpful on those inception groupings.

A brand new version of neon is about to be released. You'll be able to run all these networks out of the box (plus lots more). The new syntax is much improved and more torch or keras like (perhaps better even).

Anyway, here's a changelog of updates since the last version:

No more multiplying by zero to implement padding in fprop and bprop (I now slice both the input and the filter)

Figured out a different way to do integer division for the now dynamically sized slice lookup table.

No more atomic adds in bprop. I've cast bprop as fprop upside down and the kernels are nearly identical. It requires a dimshuffle on the filter but this just takes microseconds and a small amount of additional memory that can be shared with all conv ops. Bprop used to be bandwidth bound on those atomic adds.

Tweaked the p,q block ordering to improve L2 cache performance. I'm using a zigzag pattern now for all operations.

Update already had a mode where you could stack all the gemm ops to eliminate atomic adds, but I've streamlined that stacking operation. Update also now fetches 32 rows deep. This comes at the cost of an instruction cache miss inside the main gemm loop, but is easily covered by the occupancy. The reason for doing this is the same for using a 32x33 shared memory block to implement transpose. With N contiguous the update op has the expensive strided access patterns on both the input and delta.

I also eliminate all shared memory bank conflicts when storing the global loads to shared with some clever shifting.

Added a beta param to bprop to allow delta accumulation for inception groupings.

soumith commented 9 years ago

The CuDNN guys said that they have more slated optimizations as well, that will help GoogleNet, will updated the numbers after the final release happens.

I'm excited to see the new Neon.

In the OpenCL land, there's lots of catching up to do in terms of perf, but I am really happy that the libraries are getting feature-complete. Thanks to Hugh Perkins, Fabian Tschopp and other OpenCL torch-bearers (no pun intended).

Another interesting data point comes this week from Intel, who claim to use 64 Xeon Phi nodes to train Overfeat in 3-4 hours. That comes into the distributed territory, and it does not make me super-duper impressed, seeing internal distributed systems based on GPUs that are equally good or better. Distributed training is a separate optimization from optimizing single nodes. Intel slides at the HotChips Conference here behind conference paywall

scott-gray commented 9 years ago

Optimizing for Overfeat is like shooting fish in a barrel. And the kind of fish that no one particularly wants to eat. I'd be more impressed with Intel's results on SOTA networks of this year.

BlGene commented 9 years ago

Has anyone been able to, or interested in, compiling opencl convolutions to fpgas ( using the alters Sdk). There was the somewhat similar neuflow project but I haven't heard from them in a while.

soumith commented 9 years ago

@BlGene neuflow used to be open-source, but they closed it up and are building a startup around it called TeraDeep. Neuflow was also built for inference, not training, and it was based on fixed-point FPGAs. If there are good floating point FPGAs that are affordable, it might be a good idea to start a new community project in that direction.

hughperkins commented 9 years ago

@BlGene Do you have some kind of indicative figures on a suitable FPGA in 100-400usd price range, and its relative performance, on convolution, with GPUs in same price range? Could be theoretical analysis, and could be approximate, as long as such caveats are clearly stated.

BlGene commented 9 years ago

@soumith @hughperkins

I don't have anything concrete. I did a bit of research and came up with the following. I haven't read the papers in detail but from what I understand the sentiment seems to be a lot of work and not much faster yet ( is this correct? ). It might be possible to ask people if they are interested in comparable benchmarking... if they feel confident ;)

Quora: http://www.quora.com/Is-implementing-deep-learning-on-FPGAs-a-natural-next-step-after-the-success-with-GPUs

Research Papers:

Companies:

At Altera: http://www.slideshare.net/embeddedvision/a04-altera-singh
At Altera-Baidu: http://newsroom.altera.com/press-releases/altera-baidu-fpga-cloud-data-centers.htm
At Nervana Systems: https://gigaom.com/2014/08/21/nervana-systems-raises-3-3m-to-build-hardware-designed-for-deep-learning/
At Auviz Systems: http://auvizsystems.com/products/auvizdnn/

What are your thoughts? Maybe @scott-gray can say something for Nervana?

scott-gray commented 9 years ago

I cant really speak much of our own hardware efforts except to say it should be extremely competitive to GPUs. Generally speaking, any ASIC that's custom designed for a particular task is going to be faster than one with a more general purpose.

Though there is this recent bit on Microsoft's efforts with FPGAs: http://www.theplatform.net/2015/08/27/microsoft-extends-fpga-reach-from-bing-to-deep-learning/

Seems for them the real advantage is in power savings.

hughperkins commented 8 years ago

Looks like cuDNN is catching up. Those VGG numbers are really good for an FFT implementation.

Is cuDNN using FFT then? How/what do we know about how cuDNN works?

soumith commented 8 years ago

@hughperkins the CuDNN manual details the available algorithms they use. Also, the headers give hints as well. https://github.com/soumith/cudnn.torch/blob/R5/ffi.lua#L394-L402

hughperkins commented 8 years ago

the CuDNN manual details the available algorithms they use.

Ok. To what extent is the CuDNN manual publicly, and to what extent does one have to click through some agreement where one agrees not to reveal its contents? I guess NVIDIA has larger pockets than I do :-D

(I can find plenty of stuff about v1, eg http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-sharan-chetlur-cudnn-webinar.pdf and http://arxiv.org/pdf/1410.0759v2.pdf , but dont seem to be able to find any sources for v2++?)

hughperkins commented 8 years ago

(well... seems there is a paragaph in the Lavin paper https://arxiv.org/pdf/1509.09308.pdf which asserts it is using FFT:

"The FFT and convolution theorem have been used to reduce the arithmetic complexity of convnet layers, first by Mathieu et al. [10], then refined by Visalache et al. [12], and then implemented in the NVIDIA cuDNN library[1]"

... so maybe I can just cite that??? (Although figure 1 seems to imply that:

there are two CUDNN implementations: one is FFT, one is not FFT, and CUDNN switches between them using probably a heuristic, eg from the text "cuDNN appears to erroneously select its FFT algorithm for intermediate values of N despite the fact that it performs very poorly, under 2 TFLOPS."
for the lowest-level layers, which as far as I know basically dominate the time? non-FFT is being used, for all batch sizes

... so seems that stating "CUDNN is proprietary, so we cannot reason well on how it works" is not an entirely unreasonable position?)

Edit: seems I should cite http://arxiv.org/abs/1412.7580 , but I remember these used to be in convnet-benchmarks, and maybe was even the reason convnet-benchmarks were originally created :-P , but have vanished since around the time of CUDNNv2-v3, presumably because fbfft is no longer competitive with CUDNNv2-v4?

Edit2: oh wait, fbfft is still there :-)

~~Edit3: hmmm, I guess fbfft is not dependent on a blas implementation or similar? just uses its own native code? therefore easily portable to OpenCL? And very excellent performance~~ Edit4: noticed fbfft does depend on blas, so removed edit3 :-)

cliffwoolley commented 8 years ago

You can easily query which algorithm cuDNN's heuristic has selected for your problem size and memory availability. You can also just pick an algorithm and force cuDNN to use it. Or you can ask cuDNN to try every algorithm it can and report how long they took, so that you can for example pick the definitely fastest available one even if the heuristically chosen one wouldn't have been the optimal choice.

cuDNN's EULA (while yes you do have to agree to it to use the library) does not have any non-disclosure clause. Plenty of people have published papers with the results of experimenting with cuDNN already...

Hope this helps, -Cliff On May 22, 2016 3:03 AM, "Hugh Perkins" notifications@github.com wrote:

(well... seems there is a paragaph in the Lavin paper which asserts it is using FFT:

"The FFT and convolution theorem have been used to reduce the arithmetic complexity of convnet layers, first by Mathieu et al. [10], then refined by Visalache et al. [12], and then implemented in the NVIDIA cuDNN library[1]"

... so maybe I can just cite that??? (Although figure 1 seems to imply that:

there are two CUDNN implementations: one is FFT, one is not FFT, and CUDNN switches between them using probably a heuristic, eg from the text "cuDNN appears to erroneously select its FFT algorithm for intermediate values of N despite the fact that it performs very poorly, under 2 TFLOPS."

for the lowest-level layers, which as far as I know basically dominate the time? non-FFT is being used, for all batch sizes

... so seems that stating "CUDNN is proprietary, so we cannot reason well on how it works" is not an entirely unreasonable position?)

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/56#issuecomment-220824143

hughperkins commented 8 years ago

@cliffwoolley Thanks! Bang goes my excuse for not learning about CUDNN :-D

Edit: ok, seems your reading seems plausible. The API itself might be protected by copyright plausibly, but presumably 'fair use' applies, as far as stating what is in the api, describing it and so on?

andravin commented 8 years ago

@hughperkins as @cliffwoolley suggested, call cudnnFindConvolutionForwardAlgorithm() to find the fastest cuDNN convolution algorithm for a given layer configuration.

In my paper I used cudnnGetConvolutionForwardAlgorithm() which cannot be relied on to select the fastest algorithm. As of cuDNN v.3. at least, it would select FFT for moderate batch sizes where direct convolution would have been faster. I wanted to compare FFT and Winograd directly at those sizes, so I left it that way, but perhaps it is a bit confusing.

Note that you explicitly select the algorithm to use in any cuDNN operation. For fprop convolution, the algorithms are enumerated in cudnnConvolutionFwdAlgo_t. As of cuDNN v.5., one of the choices is CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD. You could select algorithms manually and see what happens if you are curious about how they compare.

hughperkins commented 8 years ago

As of cuDNN v.5., one of the choices is CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD.

Heh! Nice :-) By the way, my apologies, I didnt realize until about 17 hours ago who you are. But I know now :-) Or, at least, I am becoming aware of your contribution to http://arxiv.org/abs/1509.09308 , which obviously blows the previous approaches to GPU convolution out of the water :-)

soumith / convnet-benchmarks