Nervana's Neon and Winograd

soumith commented 8 years ago

After serious perf improvements by NVIDIA's CUDNN R4 across board, I suppose Nervana weren't too happy to be left behind. They've just released (as part of Neon) their Winograd-based kernels which have a non-trivial improvement in performance. Their blog post can be found here, where Scott will be going into full detail about the technical implementation, challenges as well as data points showing no side-effects of using these kernels in terms of convergence. The implementation seems very sophisticated, and quite a challenge.

I've benchmarked them independently, and here are the numbers:

FP-16

Network Type	Nervana Neon	CuDNN R4 (Torch)	Speedup
AlexNet	78	71	0.91x
Overfeat	176	242	1.37x
VGG-A	254	471	1.85x
Googlenet-1	230	462	2.00x

FP-32

Network Type	Nervana Neon	CuDNN R4 (Torch)	Speedup
AlexNet	87	81	0.93x
Overfeat	211	268	1.27x
VGG-A	320	529	1.65x
Googlenet-1	270	470	1.74x

It's really cool that they're still squeezing out performance of this generation of hardware. They seem to have real wins when the network uses 3x3 convolutions.

At this point, I expect that this is the last round of software optimizations for small convolutions, considering that they're hitting peak limits of the GPU, but happy to be surprised :)

Full logs are checked in to the nervana/ folder.

scott-gray commented 8 years ago

Sorry, the full blog post isn't quite done yet, so we put up a quick performance overview post instead. But do feel free to download the new neon and try out the kernels (and even browse the source if you're curious). I'm almost done with a new Winograd kernel that should speed things up quite a bit more for smaller 3x3 layers (like in googlenet).

andravin commented 8 years ago

@soumith Why have the cuDNN R4 Googlenet-1 numbers changed?

soumith commented 8 years ago

@andravin my copy-paste screwup. fixed.

andravin commented 8 years ago

OK, thanks, I wasn't expecting the Neon Googlenet-1 speedup to decrease. ;-) So these numbers make more sense.

scott-gray commented 8 years ago

I should point out that Andrew here not only worked through all the math for Winograd but our long discussions were pretty integral to the successful design of these kernels.

Oh and while we're giving credit we were just discussing that these two guys probably deserve as much as Shmuel Winograd for working out the original math:

https://en.wikipedia.org/wiki/Andrei_Toom https://en.wikipedia.org/wiki/Stephen_Cook

I'll go into a bit more detail on this in the full blog.

jdemouth commented 8 years ago

Thanks Scott and Andrew. We also see good gains for 3x3 with Winograd.

andravin commented 8 years ago

That's great, Julien. Can't wait to see Winograd/Cook/Toom in cuDNN. :-)

It has been almost a year since I discovered this new approach to convnet acceleration, and it is great to see these ideas having a real impact on performance now. Everybody should check out Scott's F(4x4,3x3) implementation, it is extremely clever.

jdemouth commented 8 years ago

It is indeed amazing that you were able to get F(4x4, 3x3) to work. I'm really impressed because I know for a fact that F(2x2, 3x3) is already super hard :). I am really looking forward to making it work in cuDNN.

jdemouth commented 8 years ago

@scott-gray Awesome work! The way you specialize warps in the F(4x4,3x3) kernel is just brilliant! I'm super excited and it's going to be fun to implement such a scheme for cuDNN :) and bring that speedup to the different frameworks.

andravin commented 8 years ago

Now if you guys could put your heads together and figure out a way to end the NCHW vs CHWN vs NHWC wars. There must be some way to equip these kernels with pluggable front-ends and back-ends that understand the tensor order for load / store, and leaves the computation pipeline unchanged.

scott-gray commented 8 years ago

I already have a fully fused version of that kernel that I should finish debugging this weekend. I'm hoping it will bring fprop/bprop performance for fp32 much closer to the fp16 level, as well as perform much more consistently with the size of the C/K dimensions. On the weight update side, fusion probably isn't possible due to the extremely strided memory access pattern required and no shared memory left for mitigating that. But 2 out of 3 fast operations isn't bad. In NHWC it would be the update operation that is fast and the other two slower.

I guess there's a chance in NCHW that the overlaps in the super-tiling might make full fusion possible in update, but on the downside you're slower on fprop/bprop for smallish HW because your effective tile size needs to be much bigger and you end up with a lot of zero overlap. At very small N NCHW and CHWN are pretty equivalent. But, I'm confident that CHWN is fastest overall for Winograd.

For direct conv, I'm starting to think that NHWC might be best for good performance across all minibatch sizes. There's plenty of shared memory around to efficiently transpose in place at no cost and having C as the inner dimension means that you minimize the slicing logic for all values of N and not just larger ones. But CHWN is just as good for N bigger than about 8 or so.

Also, having HWN contiguous means that you can do 1x1 conv super efficiently in a basic gemm kernel.

If I had to pick one, I'd stick with what I have: CHWN. But longer term it probably makes sense to have them all implemented to best suit the needs of the task.

Speaking of longer term, it would be nice if the community migrated to a fully open sourced implementation for all of this. This stuff is just too important to the progress of the field for it to be locked away in proprietary implementations. The more people working together on this the better for everyone. There's plenty of room to compete on the hardware implementation side.

benanne commented 8 years ago

Also, having HWN contiguous means that you can do 1x1 conv super efficiently in a basic gemm kernel.

The same goes for NHWC though, right? If I'm not mistaken, the order of these dimensions doesn't matter as long as they are contiguous. This is the TensorFlow default, I don't know if any other frameworks use it though. I think CHWN might not get adopted very easily, because everyone is used to having the leading dimension be the batch dimension nowadays (the only established framework I know of that deviates from this is cuda-convnet, which isn't used much anymore).

jdemouth commented 8 years ago

@benanne: You're right, NHWC works fine with 1x1 (as does CHWN). The issue we're having with cuDNN with 1x1 is that NCHW has the C "in the middle". Today, our direct convolution is similar to 3x3 or 5x5 for 1x1 and we are having a complex logic that we could simplify for 1x1. Scott's CHWN is "easier" to deal with in many cases. We also suffer from the fact that our filters are KCRS when CRSK (used by Scott) would be better.

On paper, NHWC has advantages over NCHW thanks to the fact that data is partly contiguous in memory. I'm only worried about the fact that NHWC could have a bad impact on the behavior of the TEX cache as fetching 8xFP16 (8 is the unrolling factor of the main loop - except for Scott's new F(4x4,3x3)) is only 16B and it's not so great with respect to cache line size.

@andravin, @scott-gray: Indeed, I think we should sit together and find a way to get the awesome performance of Scott's implementations for CHWN available to popular frameworks. We'll all be at GTC, for example. Making it open-source is a long discussion ;)

scott-gray commented 8 years ago

Right, I just meant grouped. So that's a plus for both NHWC and CHWN. You're right in that there is a lot of cuda code written for the cuDNN layout, and migrating away from that will likely be painful. But for some writing fresh code to a different layout might be a good option if they know they'll get a bit more speed. As I said, ideally you have the option for any layout.

Anyway, if you guys are happy sticking with NCHW, then we'll be happy to continue topping you on the benchmarks :) The neon framework isn't burdened by any legacy code and everything is being built from the ground up for speed. And with the new graph backend we're working on hopefully we can substantially improve on the ease of use as well (not that it's too bad right now).

scott-gray commented 8 years ago

@jdemouth You're not thinking creatively enough with leveraging shared memory to read deeper than 8 lines. You can cast any gemm or conv operation as a batched gemm and sum the results prior to writing out. The batch dimension in this case is just alternating groups of 8 rows.

Happy to chat at GTC. I was looking forward to attending your talk. And I guess I should advertise my own talk here. I was scheduled for an hour but that was mysteriously shortened to just 25 minutes. So I wont be able to go into as much depth as I'd like. But on the other hand it makes preparing for it a lot easier, which means more time to be writing kernels.

jdemouth commented 8 years ago

@scott-gray: Funny that you just mentioned that because I was thinking along those lines when you posted your comment :). I already have batched GEMM code for some scenarios. Not to mention my Winograd implementation for NCHW.

The DL track is pretty packed, that's the reason why your slot was shortened from 50 to 25 minutes. Like all the other talks. My talk was even cancelled.

scott-gray commented 8 years ago

Yah, I thought of the technique while developing the first winograd kernel. It's what I meant above in being able to leverage shared memory for in place transpose. I actually already have some fp16 32x32 gemm tiles that use this that get over 5Tflops with a minibatch of 32. The TN tile (col major) can even out perform the 128x128 tile because it halves the overall number of strided accesses to ddr at any one time.

I haven't had a chance to release these yet since I need to still need to finish the complete set. Hopefully I'll get to that in the next week or so. I have new direct conv kernels I want to build first. The cuDNN advantage on small minibatch (for non 3x3s1) will soon be going away :) The goal is end to end training of convnets at very small minibatches at full utilization.

benanne commented 8 years ago

The issue we're having with cuDNN with 1x1 is that NCHW has the C "in the middle". Today, our direct convolution is similar to 3x3 or 5x5 for 1x1 and we are having a complex logic that we could simplify for 1x1. Scott's CHWN is "easier" to deal with in many cases.

Right -- what I was saying is that, if NHWC is almost as good as CHWN, the former might be adopted much more quickly. Because TensorFlow already uses it, and because many people would find it "more natural" to have the batch size as the leading dimension.

scott-gray commented 8 years ago

I actually really like NHWC a lot, but it means I can't use my fancy new fully fused fprop/bprop F(4x4,3x3) with it. And I have a feeling the performance with it will be too good to throw away. The current partially fused kernels are trivial to convert to any layout. Just modify the external transform cuda-c code and then tweak a few lines in the batched gemm assembly for setting up the output pointers.

jdemouth commented 8 years ago

@benanne: I agree with you... We want a layout which can be adopted by the community and which is good for performance. NCHW is widely adopted (and we are making it faster at each new release of cuDNN). CHWN is easier to deal with in many cases and Scott is pushing its performance to awesome levels but it is a somewhat weird layout. Maybe NHWC brings the best of both worlds together :).

jdemouth commented 8 years ago

@scott-gray: What prevents you from using your new fused kernel with NHWC (except for the time to write it)? Is it a fetch issue due to your need for LDG.128?

Btw, were the numbers quoted in the benchmark all obtained using F(4x4,3x3) or do you use F(2x2,3x3) for some of the layers?

benanne commented 8 years ago

Here's another approach for tackling this "optimal layout" issue in frameworks using the computational graph paradigm (such as Theano and TensorFlow): stick with the canonical NCHW or NHWC layout on the surface, but have optimizations that insert alternative implementations using more efficient layouts, as well as the necessary reshape operations at the input and output side. Since many convolution and pooling operations usually follow each other (and elementwise nonlinearities are not affected by the layout), spurious reshapes can then be eliminated quite easily.

andravin commented 8 years ago

An API is also an important requirement for adoption. That is the real reason cuDNN has been so successful, it defined a low level C API for deep learning primitives, and nobody else did. cuDNN is both the standard API and its only implementation.

If Neon kernels were wrapped in the cuDNN API then it would be trivial to support them in your favorite framework (provided they are sane about allowing different tensor formats).

Maybe the cuDNN API is not ideal, maybe we could do better. But coding to a standard API is key to providing fast kernels that framework maintainers can actually use.

scott-gray commented 8 years ago

@jdemouth: I pick the best of both. Most of the time it's the 4x4 numbers. But for small HWN the 2x2 can be faster. Or I guess for small C/K too when the external transform isn't well amortized. But the fully fused kernel will solve that.

The fused kernel has no available shared memory to do the in place transpose required of NHWC in fprop/bprop. For update, the data is laid out fine, and that kernel could be fused instead. But that's just 1 of 3 ops instead of 2/3. Plus you want fprop to be the fastest for use in inference. Also fusing update is much more problematic because there's a lot of predicates that need to be recomputed any time you change x or y.

@benanne That's basically what I recommended to the TF guys. But you definitely want to avoid dimshuffles between every op. But I guess your point is that the graph optimizer should be smart enough to eliminate dimshuffles that cancel each other out.

@andravin: I've wanted to put together an API but all of my time is devoted to writing kernels and just when I start to think things have stabilized enough to do this, someone comes along and asks you to implement some new fancy algorithm :)

scott-gray commented 8 years ago

To elaborate on NHWC in fprop/bprop, the 2 image load warps are making 32*36 loads to distinct addresses, only 2 channels deep. That's way more than can fit in L1 so you end up fetching the same transaction 4 times and saturating both L2 and DDR traffic.

andravin commented 8 years ago

A standard API for deep learning primitives would also mean that frameworks would be able to support any GPU or hardware platform that implements the API. The fact that none of us are even thinking about that is another symptom of our dangerous monoculture.

scott-gray commented 8 years ago

An API has definitely been on my mind.. I just wanted to finish a complete set of kernels first. The only problem is that I keep changing the definition of complete. Anyway, I need to get some sleep. It's been nice chatting with you guys.

jdemouth commented 8 years ago

Indeed, it was great chatting with all of you. Thanks.

scott-gray commented 8 years ago

Oh, and another interesting constraint is batch norm. Reducing HWN is rather straight forward and fast with CHWN. It's just a reshape(C,-1).sum(axis=1). NCHW isn't too bad (but probably annoying). NHWC is a bit trickier to optimize as axis=0 reductions lead to expensive strided memory access patterns if done naively.

Another interesting point on the Nervana kernels is that they all have the "mean" component of batchnorm optionally compounded directly inside of the conv kernel at not cost. Currently this is done with atomics but I have a deterministic way I want to change it to that should be just as fast. Many other common operations can be compounded inside of gemm/conv kernels. Incidentally, all the kernels can now be run in full deterministic mode with virtually no change in performance.

Anyway, I'm looking forward to Soumith's new set of benchmarks. There's a ton of optimizations that we've made in neon that the current set just don't expose.

scott-gray commented 8 years ago

Oh, and another thought. I recently wrote a very fast generalized dimshuffle (src) routine for neon that implements the full numpy.transpose spec. So if there is some custom kernel you want to write that is more natural in one format over another, then it's now easy to get that. And so long as you're not doing it on every layer there would be negligible impact to speed. For example, ROI pooling for RCNN networks is far easier to implement with NHWC. But even if you are using it a lot, it's about as fast as an fprop_relu op.

ozabluda commented 8 years ago

@scott-gray: Awesome, as always! What do you consider "fully fused kernel", compared to a "partially fused kernel", and what is its real advantages? Guarantee that the data is available in L1 or something else?

scott-gray commented 8 years ago

Right now the input transforms are handled with external cuda kernels. Then the batched gemm kernel is able to fit all 36 tiles and do the output transform in place (fused). I also now have an fprop/bprop kernel that does all transforms internally to one kernel. This one performs well when HW is large (high external transform costs), but there are warp scheduling issues that I'm not sure I'm going to be able to work around to make it faster overall. If the fused kernel weren't having IPC issues it would have a huge advantage in L1/L2 utilization. The transforms expand image and delta by 9/4 and filters by 4. Plus you lose the overlap in image tiles after the transform.

The main advantage to the external transform is the ability to combine it with a transpose. It's not possible to efficiently do transposes in place with a batched gemm kernel that consumes all your shared memory. So in CHWN, the update operation will probably always have to use external transforms.

ozabluda commented 8 years ago

Is "input/output transform" what the paper calls "data/inverse transform" or something else? For F(4x4, 3x3) those transforms do expand image and delta by 9/4 and filters by 4.

BTW, what is the breakdown of utilization for fprop + delta + update? From your comments in this thread and earlier, my best guess is (~300%+~200%+~100%)/3~=200%.

scott-gray commented 8 years ago

For input/output I mean the transforms needed to be applied to the input/out of the batched gemm respectively. Utilization is high for all 3 operations. For fp32 and N=32 in VGG I get these speedups: fprop:1.94x bprop: 1.92x update: 1.71x

I'll cover all this in more detail in the blog update.

ozabluda commented 8 years ago

Thank you. Eagerly waiting for the blog update. I got confused by your earlier prediction of 3x on fprop with fp32 F(4x4, 3x3), and the following quote, I've misinterpreted. Slow due to impossible fusion, doesn't mean 1x, but "only" 1.71x:

On the weight update side, fusion probably isn't possible due to the extremely strided memory access pattern required and no shared memory left for mitigating that. But 2 out of 3 fast operations isn't bad.

jdemouth commented 8 years ago

Those speedups are already awesome. As soon as I'm done with a few other things, I'll work on the integration in cuDNN. ;)

scott-gray commented 8 years ago

I'm guessing these fprop/bprop kernels wont be as fast in practice with the NCHW layout. Frequently your HW dim is only moderate or small in size. To get good contiguous memory access on the external transform you'll need to include multiple points of H and/or W in the 32 points of the outer product tile. This will give you a larger effective tile size and hence the potential for more zero overlap. The amount of zero overlap is one of the biggest factors in determining the speedup.

In CHWN or NHWC you can load a single point of HW with maximal DDR efficiency, thereby minimizing your effective tile size.

With small N you are forced to use multiple points of HW just to fill the gemm tile and I think all formats are about the same, perhaps with CHWN being slightly faster due to having potentially less overfetch and better cache utilization.

In the update operation you can skip over bad points of HW while reducing them and hence it performs more consistently with small N. On the transform they're all probably just as fast.. but again NCHW has overfetch potential.

It's just basically a bad idea to have your inner contiguous dimension to possibly be an odd (or non-power of 2) number. This is going to make fp16x2 to be rather difficult to implement on Pascal. And I predict cuDNN will be forced to switch at this point. Both CHWN and NHWC are good options with CHWN being a bit faster for winograd, and NHWC being slightly better for direct conv. CHWN will be faster for inference work.

jdemouth commented 8 years ago

To say the least NCHW has some disadvantages ;)

bhack commented 8 years ago

@scott-gray @andravin A vendor neutral API would really be a game changer. In the Vulkan era do you think that will be enougth to target SPIR-V? /cc @naibaf7 @keryell

jdemouth commented 8 years ago

@bhack Both Scott and us target our (NVIDIA) GPU arch using assembly for those very specific tasks (the rest of the code is written in high level/easy to use CUDA). So far compiler generated code has not reached the level of performance achieved by ninja programmers directly in assembly. The compiler has hard time doing perfect register allocation and instruction scheduling.

bhack commented 8 years ago

@jdemouth Yes i know that all are working at assembly level, a level that in AMD world we could call GCN. But I hoped that a common target like SPIR-V changed something on compiler optimization side. What about targeting LLVM? AMD started to release an interesting GCN 3.0 llvm backend

jdemouth commented 8 years ago

Most of the optimization work is done at the level covered by the compiler backend (register allocation and instruction scheduling). It comes after IR generation so most (all?) the problem remains identical. The techniques we use seem very hard (impossible?) to implement in a compiler backend.

bhack commented 8 years ago

@LunarG has heavily experimented on a two step IR but on shaders. Probably can give us some feedback on backend limits based on LunarGlass experience.

jdemouth commented 8 years ago

Don't get me wrong. I'm not saying compilers do a bad job in general. Most of the time they do awesome. However, fast convolutions are extremely complicated and some of the techniques are even hard to express in a high level language.

bhack commented 8 years ago

@jdemouth Yes I'm only guessing if exist and what is the vendor neutral "lowest common denominator". This could still give a margin for hardware vendors to compete but also give some chance for an unified API for developer interested in a vendor neutral solution.

jdemouth commented 8 years ago

I get your point. So far, I do not see the solution but having a higher level solution - if interesting for vendor neutrality - would surely help with development time and innovation.

bhack commented 8 years ago

OpenVX was a good example for interface collaboration that involved many stakeholders (Nvidia included). But it has totally lost the occasions to cover deep learning needs in the actual release. In the meantime Google is trying to push an llvm subgroup for stream executor with "deep learning" canned operations. See last messages in https://github.com/tensorflow/tensorflow/issues/22 and @henline bootstrap doc at https://github.com/henline/streamexecutordoc

scott-gray commented 8 years ago

It is simply not possible to develop efficient dense linear algebra kernels with the current intermediate representations available (like ptx). That's not to say that an IR couldn't be developed that would make it possible. Pascal will be largely binary compatible with the Maxwell ISA, but when Volta rolls around I may adapt my assembler to let you target both architectures with one language. Though maybe it's not worth the effort since new hardware always frees up more resources making kernel design decisions very different.

I guess the real key would be to have an IR that can target both Nvidia and AMD. I've read through the GCN spec and there's a lot of overlap, but again the differences are big enough to make a large impact in how you would design kernels. But still, having a common language would make development for both targets much easier and perhaps allow some code sharing.

bhack commented 8 years ago

The common IR that target both NVIDIA and AMD is SPIR-V (and was co-designed). But seems that it is not enough for achieve this level of optimization. So if the biggest common multi stakeholders effort on a common IR is not enough I think that is better to extend standard API at higher level like pushing, in the next version of OpenVX, support for unfied tensor operations API that fits deep learning needs.

jdemouth commented 8 years ago

Yes, the high level API approach looks more promising.

soumith / convnet-benchmarks

Nervana's Neon and Winograd #93

FP-16

FP-32