Inconsistencies with nn

szagoruyko commented 8 years ago

Let's track them here:

There is no nn.SpatialLogSoftMax, should be addressed by https://github.com/torch/nn/pull/560
There is no nn.SpatialCrossEntropyCriterion ~~(and cudnn test is broken)~~
nn.TemporalConvolution does not have padH support and the current implementation of cudnn.TemporalConvolution needs modifications to support cudnn.convert in R4
~~nn.SpatialBatchNormalization does not support 5D inputs in R4~~
~~nn.SpatialConvolution and cudnn.SpatialConvolution in R3 does not support noBias() (will cause error on conversion)~~
nn.SpatialConvolution does not support groups (will cause error on cudnn.convert cudnn -> nn)

szagoruyko commented 8 years ago

~~nn.SpatialBatchNormalization has running_var, cudnn.SpatialBatchNormalization has running_std~~ fixed in R5

apaszke commented 8 years ago

@szagoruyko nn.SpatialConvolution now supports :noBias()

szagoruyko commented 8 years ago

@apaszke thanks, updated the comments

adroit91 commented 8 years ago

We have observed that converting using cudnn.convert doesn't work for all modules, for example cudnn.ClippedReLU doesn't get translated into nn despite mention of API compatibility.

szagoruyko commented 8 years ago

@adroit91 we could convert ClippedReLU to HardTanh @ibmua should be easy to implement groups with THNN, a simple for loop I think? @fmassa

soumith commented 8 years ago

Hi @ibmua . You misunderstand the purpose of nn.Parallel. It is not parallel compute, but it is a container pattern that executes parallel branches. It wont be faster, or use CPU threads...

soumith commented 8 years ago

@ibmua wrt the performance variance as you change number of groups, CPU / GPU performance is not always linear wrt the amount of compute. If you have very little work, it also does not use the GPU compute fully, which is for example what i suspect is happening in the groups=2 vs groups=4 case :)

ibmua commented 8 years ago

So I've been trying to research the grouped theme, but just found out today that these guys already went deep into this thing with a lot of hardware. https://arxiv.org/pdf/1605.06489v1.pdf Proves high importance of groups, especially on a CPU. I'm sure they'd get a comparable actual speedup on GPUs, my guess is that it's not that large only because CuDNN's implementation of them is crappy. I've made a plain non-winograd kernel for fully-grouped forward that was ~20x faster, at least on many tests I've tried, than CuDNN v5.1. It's just not optimized for groups, especially large ones. My bet is that their actual CPU speedup is also modest compared to what's possible.

I wanted to write the kernel and stuff for Torch, but the data structure is almost completely undocumented and from other source code I can't get the heck of what's being done. The code is a define upon a define and where it's defined is completely undefined, the whole thing is just a mess that's impossible to comprehend.

Edit: Oh, so I've viewed the code https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua and I see you're actually simulating grouped convolution ability by launching kernels consecutively and they're not actually a part of CuDNN. That explains.

I wonder what sense does it make for NVidia to close-source CuDNN. Can't see any sanity in that.

Edit2: Interesting to note that grouped convs are also a volumetric local pooling.