soumith / cudnn.torch

Torch-7 FFI bindings for NVIDIA CuDNN
BSD 2-Clause "Simplified" License
408 stars 159 forks source link

Inconsistencies with nn #98

Open szagoruyko opened 8 years ago

szagoruyko commented 8 years ago

Let's track them here:

szagoruyko commented 8 years ago
apaszke commented 8 years ago

@szagoruyko nn.SpatialConvolution now supports :noBias()

szagoruyko commented 8 years ago

@apaszke thanks, updated the comments

adroit91 commented 8 years ago

We have observed that converting using cudnn.convert doesn't work for all modules, for example cudnn.ClippedReLU doesn't get translated into nn despite mention of API compatibility.

szagoruyko commented 8 years ago

@adroit91 we could convert ClippedReLU to HardTanh @ibmua should be easy to implement groups with THNN, a simple for loop I think? @fmassa

soumith commented 8 years ago

Hi @ibmua . You misunderstand the purpose of nn.Parallel. It is not parallel compute, but it is a container pattern that executes parallel branches. It wont be faster, or use CPU threads...

soumith commented 8 years ago

@ibmua wrt the performance variance as you change number of groups, CPU / GPU performance is not always linear wrt the amount of compute. If you have very little work, it also does not use the GPU compute fully, which is for example what i suspect is happening in the groups=2 vs groups=4 case :)

ibmua commented 8 years ago

So I've been trying to research the grouped theme, but just found out today that these guys already went deep into this thing with a lot of hardware. https://arxiv.org/pdf/1605.06489v1.pdf Proves high importance of groups, especially on a CPU. I'm sure they'd get a comparable actual speedup on GPUs, my guess is that it's not that large only because CuDNN's implementation of them is crappy. I've made a plain non-winograd kernel for fully-grouped forward that was ~20x faster, at least on many tests I've tried, than CuDNN v5.1. It's just not optimized for groups, especially large ones. My bet is that their actual CPU speedup is also modest compared to what's possible.

I wanted to write the kernel and stuff for Torch, but the data structure is almost completely undocumented and from other source code I can't get the heck of what's being done. The code is a define upon a define and where it's defined is completely undefined, the whole thing is just a mess that's impossible to comprehend.

Edit: Oh, so I've viewed the code https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua and I see you're actually simulating grouped convolution ability by launching kernels consecutively and they're not actually a part of CuDNN. That explains.

I wonder what sense does it make for NVidia to close-source CuDNN. Can't see any sanity in that.

Edit2: Interesting to note that grouped convs are also a volumetric local pooling.

ibmua commented 7 years ago

Okay, now that ResNeXt is out https://arxiv.org/pdf/1611.05431.pdf , I'm hoping that I'm not the only one who understands importance of native grouped convolutions here? Since groups are exactly the only thing added vs older ResNet. And it's not groups=2, groups=4, it's groups=32 and the likes. Current codebase is totally unsuitable.

ibmua commented 7 years ago

https://arxiv.org/pdf/1611.05431.pdf Performance. For simplicity we use Torch’s built-in grouped convolution implementation, without special optimization. We note that this implementation was brute-force and not parallelization-friendly. On 8 GPUs of NVIDIA M40, training 32×4d ResNeXt-101 in Table 3 takes 0.95s per mini-batch, vs. 0.70s of ResNet-101 baseline that has similar FLOPs. We argue that this is a reasonable overhead. We expect carefully engineered lower-level implementation (e.g., in CUDA) will reduce this overhead. We also expect that the inference time on CPUs will present less overhead. Training the 2×complexity model (64×4d ResNeXt-101) takes 1.7s per mini-batch and 10 days total on 8 GPUs.

A definite knock on your door.

Really flatters me that I've been researching the very same concept as Kaiming & co throughout Aug-Sept. Couldn't come up with optimal structure though, while I've tried plenty, partly, because I don't have such a shitton of hardware, nor there is any framework with adequate implementation of grouped convs to be able to try out different stuff on my much limited, comparably - 2xGPUs in total - hardware (tried on CIFAR only, of course, no way I could run ImageNet). =( I wonder how many failed attempts with slightly different structures they've had along the way. =) And I totally wonder why they didn't hire some Scott Gray to implement grouped convs which would probably cost less than the additional processing power did. I wanted to implement the thing myself on a CUDA level without Winograd opt., learned CUDA even for that very purpose, but all of the existing frameworks turned out too opaque to me to potentially integrate any code. Also, I remember, some were probably not very CUDA-friendly in terms of the way the data was formatted in them, I think Torch was one of those. Scott can, probably, overcome that problem, as I recall he wanted to write some fast kernels for very small batches, which might have a common solution with these problems of data requesting from GPU's RAM.

ibmua commented 7 years ago

Actually, taking a closer look, Kaiming's paper doesn't have a lot of novelty vs https://arxiv.org/pdf/1605.06489v1.pdf which I've already linked to, basically it's a follow-up on that study, more of a confirmation on the subject. I'm guessing there's quite some room for improvement, I'm very unsure if his 1->3->1 blocks are actually optimal, since 1x1 convs are extremely GPU RAM throughput-hungry and also having more channels consumes a lot more memory. Basically, while I was researching this very same thing I considered those implications and for these reasons was quite discouraged myself. Kaiming on the other hand tries to completely ignore that issue in his paper also ignoring the fact that what he's comparing is in fact a wider resnet with groups to a narrower one without. Not his best paper, IMHO. But still, proved a point that fast group convs are completely necessary.

ibmua commented 7 years ago

NVidia said they're planning to release some implementation of groups in their next CuDNN.

ibmua commented 7 years ago

https://developer.nvidia.com/cudnn so grouped convs are now available in CuDNN v7.