Open shrubb opened 7 years ago
@shrubb can you post the script you used for benchmarking?
@killeent here it is. Hope it's just to check if my benchmarking is sane. The script's dirty, I just use it for rare checks.
I don't pretend to bring in some super-optimal implementation, there's actually little new in this PR. Just a better stub.
For anyone working on this, my previous attempts to employ "smarter" cublas usage are in this branch.
cc: @ajtulloch @ngimel
there's also https://github.com/szagoruyko/pyinn/blob/master/pyinn/conv2d_depthwise.py which is based on the Caffe code. it's a specialized conv. I presume you already benchmarked something in this order.
One thing to improve it is to add template specializations for the most common kH/kW/stride/dilation (e.g. IMO it's worth adding a template specialization for 3x3s1 and re-benchmarking mobilenet/shufflenet).
@soumith pyinn implementation is about the same with what Egor did, simple for loops and sum on grad wrt weight
@ajtulloch already tried this, gives just a negligible improvement:
Again, I believe this is NOT practical too. The code for weight gradients is also super simple, could be easily accelerated as well. Benchmarking it for fun.
@schrubb how much did you specialize in that benchmark? i.e ideally you'd have a specialization for statically-known kh, kw, stride dilation + separate paths in the kernel for definitely-inbounds/possibly-outbounds?
@ajtulloch hardcoded (kH, kW, padH, padW, strideH, strideW) = (3, 3, 1, 1, 1, 1). Dilation is also fixed (1,1) in both kernels. I don't see why tracking inbound/outbound pixel will matter, memory reads are mostly contiguous anyway. To me, all this seems insignificant compared to main arithmetic routines' load.
Anyway, once again, I don't understand what is this all benchmarking and fighting for another 0.001 seconds for. Nobody is going to use this. Everyone's on PyTorch, and NVIDIA is likely to release this kind of conv in cuDNN sooner or later.
Pytorch shares backend with torch, so this could be exposed in pytorch. I don't know how it compares with pyinn, though, in terms of performance. Nvidia won't release it tomorrow, and people have been wanting to use depthwise separable convolutions for years, so anything helps.
@ngimel looks like cudnn7 supports grouped convolutions, would it be slower than such implementation?
@szagoruyko it seems that your pyinn kernels for depthwise convolutions are better than cudnn7 with grouped convolutions. But I'll let @ngimel comment more on that.
The current stub implementation is totally impractical; the fastest GPU depthwise conv for Torch was cuDNN's grouped conv with
self.groups == self.nInputPlane
. Still, even with cuDNN, Google's MobileNets for example would run 2 times slower than ResNet-34.I've tried lots of lots of lots of methods to efficiently reduce this stuff to cublas routines: using
gemmBatched
, grouping channels for heaviergemm
s load etc. Unfortunately, then I've only managed to roughly reach cuDNN' performance in backward pass and make a 1.5x speedup in MobileNet forward pass.Surprisingly, the fastest option by far turned to be...the super dumb
for
loop. Here it is (with a bit smarter option foraccGradParams
though). The forward/backward passes are now at least 45x/8x faster than the original implementation respectively. Default MobileNet's inference enjoys a speedup over cuDNN case of 3.57x on Maxwell and 5.18x on Pascal.Tested all the output and gradients with large batch size &
nInputPlane
and variousnOutputPlane
.Although the
weight
shape is(nOutputPlane) x (nInputPlane) x (kH) x (kW)
which perfectly corresponds to cuDNN bindings, I didn't like it much since the weight tensor needs to be transposed back and forth when you need almost any kind of matmul/matvec. Don't know if it's critical, but I left it as is just to be safe anyway.