Open szagoruyko opened 8 years ago
nn.SpatialBatchNormalization
has running_var
, cudnn.SpatialBatchNormalization
has running_std
@szagoruyko nn.SpatialConvolution
now supports :noBias()
@apaszke thanks, updated the comments
We have observed that converting using cudnn.convert doesn't work for all modules, for example cudnn.ClippedReLU doesn't get translated into nn despite mention of API compatibility.
@adroit91 we could convert ClippedReLU to HardTanh @ibmua should be easy to implement groups with THNN, a simple for loop I think? @fmassa
Hi @ibmua . You misunderstand the purpose of nn.Parallel. It is not parallel compute, but it is a container pattern that executes parallel branches. It wont be faster, or use CPU threads...
@ibmua wrt the performance variance as you change number of groups, CPU / GPU performance is not always linear wrt the amount of compute. If you have very little work, it also does not use the GPU compute fully, which is for example what i suspect is happening in the groups=2 vs groups=4 case :)
So I've been trying to research the grouped theme, but just found out today that these guys already went deep into this thing with a lot of hardware. https://arxiv.org/pdf/1605.06489v1.pdf Proves high importance of groups, especially on a CPU. I'm sure they'd get a comparable actual speedup on GPUs, my guess is that it's not that large only because CuDNN's implementation of them is crappy. I've made a plain non-winograd kernel for fully-grouped forward that was ~20x faster, at least on many tests I've tried, than CuDNN v5.1. It's just not optimized for groups, especially large ones. My bet is that their actual CPU speedup is also modest compared to what's possible.
I wanted to write the kernel and stuff for Torch, but the data structure is almost completely undocumented and from other source code I can't get the heck of what's being done. The code is a define upon a define and where it's defined is completely undefined, the whole thing is just a mess that's impossible to comprehend.
Edit: Oh, so I've viewed the code https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua and I see you're actually simulating grouped convolution ability by launching kernels consecutively and they're not actually a part of CuDNN. That explains.
I wonder what sense does it make for NVidia to close-source CuDNN. Can't see any sanity in that.
Edit2: Interesting to note that grouped convs are also a volumetric local pooling.
Okay, now that ResNeXt is out https://arxiv.org/pdf/1611.05431.pdf , I'm hoping that I'm not the only one who understands importance of native grouped convolutions here? Since groups are exactly the only thing added vs older ResNet. And it's not groups=2, groups=4, it's groups=32 and the likes. Current codebase is totally unsuitable.
https://arxiv.org/pdf/1611.05431.pdf Performance. For simplicity we use Torch’s built-in grouped convolution implementation, without special optimization. We note that this implementation was brute-force and not parallelization-friendly. On 8 GPUs of NVIDIA M40, training 32×4d ResNeXt-101 in Table 3 takes 0.95s per mini-batch, vs. 0.70s of ResNet-101 baseline that has similar FLOPs. We argue that this is a reasonable overhead. We expect carefully engineered lower-level implementation (e.g., in CUDA) will reduce this overhead. We also expect that the inference time on CPUs will present less overhead. Training the 2×complexity model (64×4d ResNeXt-101) takes 1.7s per mini-batch and 10 days total on 8 GPUs.
A definite knock on your door.
Really flatters me that I've been researching the very same concept as Kaiming & co throughout Aug-Sept. Couldn't come up with optimal structure though, while I've tried plenty, partly, because I don't have such a shitton of hardware, nor there is any framework with adequate implementation of grouped convs to be able to try out different stuff on my much limited, comparably - 2xGPUs in total - hardware (tried on CIFAR only, of course, no way I could run ImageNet). =( I wonder how many failed attempts with slightly different structures they've had along the way. =) And I totally wonder why they didn't hire some Scott Gray to implement grouped convs which would probably cost less than the additional processing power did. I wanted to implement the thing myself on a CUDA level without Winograd opt., learned CUDA even for that very purpose, but all of the existing frameworks turned out too opaque to me to potentially integrate any code. Also, I remember, some were probably not very CUDA-friendly in terms of the way the data was formatted in them, I think Torch was one of those. Scott can, probably, overcome that problem, as I recall he wanted to write some fast kernels for very small batches, which might have a common solution with these problems of data requesting from GPU's RAM.
Actually, taking a closer look, Kaiming's paper doesn't have a lot of novelty vs https://arxiv.org/pdf/1605.06489v1.pdf which I've already linked to, basically it's a follow-up on that study, more of a confirmation on the subject. I'm guessing there's quite some room for improvement, I'm very unsure if his 1->3->1 blocks are actually optimal, since 1x1 convs are extremely GPU RAM throughput-hungry and also having more channels consumes a lot more memory. Basically, while I was researching this very same thing I considered those implications and for these reasons was quite discouraged myself. Kaiming on the other hand tries to completely ignore that issue in his paper also ignoring the fact that what he's comparing is in fact a wider resnet with groups to a narrower one without. Not his best paper, IMHO. But still, proved a point that fast group convs are completely necessary.
NVidia said they're planning to release some implementation of groups in their next CuDNN.
https://developer.nvidia.com/cudnn so grouped convs are now available in CuDNN v7.
Let's track them here:
nn.SpatialLogSoftMax
, should be addressed by https://github.com/torch/nn/pull/560nn.SpatialCrossEntropyCriterion
(and cudnn test is broken)nn.TemporalConvolution
does not havepadH
support and the current implementation ofcudnn.TemporalConvolution
needs modifications to supportcudnn.convert
in R4nn.SpatialBatchNormalization
does not support 5D inputs in R4nn.SpatialConvolution
andcudnn.SpatialConvolution
in R3 does not supportnoBias()
(will cause error on conversion)nn.SpatialConvolution
does not support groups (will cause error oncudnn.convert
cudnn -> nn)