Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.
Machine: 6-core Intel Core i7-5930K CPU @ 3.50GHz
+ NVIDIA Titan X
+ Ubuntu 14.04 x86_64
I pick some popular imagenet models, and I clock the time for a full forward + backward pass. I average my times over 10 runs. I ignored dropout and softmax layers.
Input is described as {batch_size}x{num_filters}x{filter_width}x{filter_height}
. Where batch_size
is the number of images used in a minibatch, num_filters
is the number of channels in an image, filter_width
is the width of the image, and filter_height
is the height of the image.
The CuDNN benchmarks are done using Torch bindings. One can also do the same via Caffe bindings or bindings of any other library. This note is here to clarify that Caffe (native) and Torch (native) are the convolution kernels which are present as a default fallback. Some of the frameworks like TensorFlow and Chainer are benchmarked with CuDNN, but it is not explicitly mentioned, and hence one might think that these frameworks as a whole are faster, than for example Caffe, which might not be the case.
AlexNet (One Weird Trick paper) - Input 128x3x224x224
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
CuDNN[R4]-fp16 (Torch) | cudnn.SpatialConvolution | 71 | 25 | 46 |
Nervana-neon-fp16 | ConvLayer | 78 | 25 | 52 |
CuDNN[R4]-fp32 (Torch) | cudnn.SpatialConvolution | 81 | 27 | 53 |
TensorFlow | conv2d | 81 | 26 | 55 |
Nervana-neon-fp32 | ConvLayer | 87 | 28 | 58 |
fbfft (Torch) | fbnn.SpatialConvolution | 104 | 31 | 72 |
Chainer | Convolution2D | 177 | 40 | 136 |
cudaconvnet2* | ConvLayer | 177 | 42 | 135 |
CuDNN[R2] * | cudnn.SpatialConvolution | 231 | 70 | 161 |
Caffe (native) | ConvolutionLayer | 324 | 121 | 203 |
Torch-7 (native) | SpatialConvolutionMM | 342 | 132 | 210 |
CL-nn (Torch) | SpatialConvolutionMM | 963 | 388 | 574 |
Caffe-CLGreenTea | ConvolutionLayer | 1442 | 210 | 1232 |
Overfeat [fast] - Input 128x3x231x231
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Nervana-neon-fp16 | ConvLayer | 176 | 58 | 118 |
Nervana-neon-fp32 | ConvLayer | 211 | 69 | 141 |
CuDNN[R4]-fp16 (Torch) | cudnn.SpatialConvolution | 242 | 86 | 156 |
CuDNN[R4]-fp32 (Torch) | cudnn.SpatialConvolution | 268 | 94 | 174 |
TensorFlow | conv2d | 279 | 90 | 189 |
fbfft (Torch) | SpatialConvolutionCuFFT | 342 | 114 | 227 |
Chainer | Convolution2D | 620 | 135 | 484 |
cudaconvnet2* | ConvLayer | 723 | 176 | 547 |
CuDNN[R2] * | cudnn.SpatialConvolution | 810 | 234 | 576 |
Caffe | ConvolutionLayer | 823 | 355 | 468 |
Torch-7 (native) | SpatialConvolutionMM | 878 | 379 | 499 |
CL-nn (Torch) | SpatialConvolutionMM | 963 | 388 | 574 |
Caffe-CLGreenTea | ConvolutionLayer | 2857 | 616 | 2240 |
OxfordNet [Model-A] - Input 64x3x224x224
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Nervana-neon-fp16 | ConvLayer | 254 | 82 | 171 |
Nervana-neon-fp32 | ConvLayer | 320 | 103 | 217 |
CuDNN[R4]-fp16 (Torch) | cudnn.SpatialConvolution | 471 | 140 | 331 |
CuDNN[R4]-fp32 (Torch) | cudnn.SpatialConvolution | 529 | 162 | 366 |
TensorFlow | conv2d | 540 | 158 | 382 |
Chainer | Convolution2D | 885 | 251 | 632 |
fbfft (Torch) | SpatialConvolutionCuFFT | 1092 | 355 | 737 |
cudaconvnet2* | ConvLayer | 1229 | 408 | 821 |
CuDNN[R2] * | cudnn.SpatialConvolution | 1099 | 342 | 757 |
Caffe | ConvolutionLayer | 1068 | 323 | 745 |
Torch-7 (native) | SpatialConvolutionMM | 1105 | 350 | 755 |
CL-nn (Torch) | SpatialConvolutionMM | 3437 | 875 | 2562 |
Caffe-CLGreenTea | ConvolutionLayer | 5620 | 988 | 4632 |
GoogleNet V1 - Input 128x3x224x224
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Nervana-neon-fp16 | ConvLayer | 230 | 72 | 157 |
Nervana-neon-fp32 | ConvLayer | 270 | 84 | 186 |
TensorFlow | conv2d | 445 | 135 | 310 |
CuDNN[R4]-fp16 (Torch) | cudnn.SpatialConvolution | 462 | 112 | 349 |
CuDNN[R4]-fp32 (Torch) | cudnn.SpatialConvolution | 470 | 130 | 340 |
Chainer | Convolution2D | 687 | 189 | 497 |
Caffe | ConvolutionLayer | 1935 | 786 | 1148 |
CL-nn (Torch) | SpatialConvolutionMM | 7016 | 3027 | 3988 |
Caffe-CLGreenTea | ConvolutionLayer | 9462 | 746 | 8716 |
Original Library | Class/Function Benchmarked | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 256 | 101 | 155 |
cuda-convnet2 * | ConvLayer | 977 | 201 | 776 |
cuda-convnet** | pylearn2.cuda_convnet | 1077 | 312 | 765 |
CuDNN R2 * | cudnn.SpatialConvolution | 1019 | 269 | 750 |
Theano | CorrMM | 1225 | 407 | 818 |
Caffe | ConvolutionLayer | 1231 | 396 | 835 |
Torch-7 | SpatialConvolutionMM | 1265 | 418 | 877 |
DeepCL | ConvolutionLayer | 6280 | 2648 | 3632 |
cherry-picking**** | best per layer | 235 | 79 | 155 |
This table is NOT UPDATED For TITAN-X. These numbers below were on Titan Black and are here only for informational and legacy purposes.
Original Library | Class/Function Benchmarked | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Theano (experimental)*** | conv2d_fft | 1178 | 304 | 874 |
Torch-7 | nn.SpatialConvolutionBHWD | 1892 | 581 | 1311 |
ccv | ccv_convnet_layer | 809+bw | 809 | |
Theano (legacy) | conv2d | 70774 | 3833 | 66941 |
128x128
Batch-size 128
, Feature maps: 3->96
, Kernel Size: 11x11
, Stride: 1x1
64x64
Batch-size 128
, Feature maps: 64->128
, Kernel Size: 9x9
, Stride: 1x1
32x32
Batch-size 128
, Feature maps: 128->128
, Kernel Size: 9x9
, Stride: 1x1
16x16
Batch-size 128
, Feature maps: 128->128
, Kernel Size: 7x7
, Stride: 1x1
13x13
Batch-size 128
, Feature maps: 384->384
, Kernel Size: 3x3
, Stride: 1x1
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 57 | 27 | 6 | 2 | 9 | 101 |
cuda-convnet2 * | ConvLayer | 36 | 113 | 40 | 4 | 8 | 201 |
cuda-convnet** | pylearn2.cuda_convnet | 38 | 183 | 68 | 7 | 16 | 312 |
CuDNN R2 | cudnn.SpatialConvolution | 56 | 143 | 53 | 6 | 11 | 269 |
Theano | CorrMM | 91 | 143 | 121 | 24 | 28 | 407 |
Caffe | ConvolutionLayer\ |
93 | 136 | 116 | 24 | 27 | 396 |
Torch-7 | nn.SpatialConvolutionMM | 94 | 149 | 123 | 24 | 28 | 418 |
DeepCL | ConvolutionLayer | 738 | 1241 | 518 | 47 | 104 | 2648 |
cherry-picking**** | best per layer | 36 | 27 | 6 | 2 | 8 | 79 |
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 76 | 45 | 12 | 4 | 18 | 155 |
cuda-convnet2 * | ConvLayer | 103 | 467 | 162 | 15 | 29 | 776 |
cuda-convnet** | pylearn2.cuda_convnet | 136 | 433 | 147 | 15 | 34 | 765 |
CuDNN R2 | cudnn.SpatialConvolution | 139 | 401 | 159 | 19 | 32 | 750 |
Theano | CorrMM | 179 | 405 | 174 | 29 | 31 | 818 |
Caffe | ConvolutionLayer\ |
200 | 405 | 172 | 28 | 30 | 835 |
Torch-7 | nn.SpatialConvolutionMM | 206 | 432 | 178 | 29 | 32 | 877 |
DeepCL | ConvolutionLayer | 484 | 2144 | 747 | 59 | 198 | 3632 |
cherry-picking**** | best per layer | 76 | 45 | 12 | 4 | 18 | 155 |