torch / cunn

Other
215 stars 174 forks source link

Faster SpatialDepthWiseConvolution #481

Open shrubb opened 7 years ago

shrubb commented 7 years ago

The current stub implementation is totally impractical; the fastest GPU depthwise conv for Torch was cuDNN's grouped conv with self.groups == self.nInputPlane. Still, even with cuDNN, Google's MobileNets for example would run 2 times slower than ResNet-34.

I've tried lots of lots of lots of methods to efficiently reduce this stuff to cublas routines: using gemmBatched, grouping channels for heavier gemms load etc. Unfortunately, then I've only managed to roughly reach cuDNN' performance in backward pass and make a 1.5x speedup in MobileNet forward pass.

Surprisingly, the fastest option by far turned to be...the super dumb for loop. Here it is (with a bit smarter option for accGradParams though). The forward/backward passes are now at least 45x/8x faster than the original implementation respectively. Default MobileNet's inference enjoys a speedup over cuDNN case of 3.57x on Maxwell and 5.18x on Pascal.

Tested all the output and gradients with large batch size & nInputPlane and various nOutputPlane.

Although the weight shape is (nOutputPlane) x (nInputPlane) x (kH) x (kW) which perfectly corresponds to cuDNN bindings, I didn't like it much since the weight tensor needs to be transposed back and forth when you need almost any kind of matmul/matvec. Don't know if it's critical, but I left it as is just to be safe anyway.

killeent commented 7 years ago

@shrubb can you post the script you used for benchmarking?

Also cc https://github.com/pytorch/pytorch/issues/1708

shrubb commented 7 years ago

@killeent here it is. Hope it's just to check if my benchmarking is sane. The script's dirty, I just use it for rare checks.

I don't pretend to bring in some super-optimal implementation, there's actually little new in this PR. Just a better stub.

For anyone working on this, my previous attempts to employ "smarter" cublas usage are in this branch.

soumith commented 7 years ago

cc: @ajtulloch @ngimel

soumith commented 7 years ago

there's also https://github.com/szagoruyko/pyinn/blob/master/pyinn/conv2d_depthwise.py which is based on the Caffe code. it's a specialized conv. I presume you already benchmarked something in this order.

ajtulloch commented 7 years ago

One thing to improve it is to add template specializations for the most common kH/kW/stride/dilation (e.g. IMO it's worth adding a template specialization for 3x3s1 and re-benchmarking mobilenet/shufflenet).

szagoruyko commented 7 years ago

@soumith pyinn implementation is about the same with what Egor did, simple for loops and sum on grad wrt weight

shrubb commented 7 years ago

@ajtulloch already tried this, gives just a negligible improvement: image

shrubb commented 7 years ago

Again, I believe this is NOT practical too. The code for weight gradients is also super simple, could be easily accelerated as well. Benchmarking it for fun. image

ajtulloch commented 7 years ago

@schrubb how much did you specialize in that benchmark? i.e ideally you'd have a specialization for statically-known kh, kw, stride dilation + separate paths in the kernel for definitely-inbounds/possibly-outbounds?

shrubb commented 7 years ago

@ajtulloch hardcoded (kH, kW, padH, padW, strideH, strideW) = (3, 3, 1, 1, 1, 1). Dilation is also fixed (1,1) in both kernels. I don't see why tracking inbound/outbound pixel will matter, memory reads are mostly contiguous anyway. To me, all this seems insignificant compared to main arithmetic routines' load.

Anyway, once again, I don't understand what is this all benchmarking and fighting for another 0.001 seconds for. Nobody is going to use this. Everyone's on PyTorch, and NVIDIA is likely to release this kind of conv in cuDNN sooner or later.

ngimel commented 7 years ago

Pytorch shares backend with torch, so this could be exposed in pytorch. I don't know how it compares with pyinn, though, in terms of performance. Nvidia won't release it tomorrow, and people have been wanting to use depthwise separable convolutions for years, so anything helps.

szagoruyko commented 7 years ago

@ngimel looks like cudnn7 supports grouped convolutions, would it be slower than such implementation?

fmassa commented 7 years ago

@szagoruyko it seems that your pyinn kernels for depthwise convolutions are better than cudnn7 with grouped convolutions. But I'll let @ngimel comment more on that.