Benchmark TensorFlow - Githubissues

soumith commented 8 years ago

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners. When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):

in-place ReLU seems non-existent in practice.
- Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
VGG with batchsize 64 goes Out of Memory (Edit: VGG memory issue was solved by using the BFC allocator updated by GOOG). ~~The largest batch-size I could fit is 32 (tried 32, 64).~~
I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	96	32	64
Nervana (Neon)	101	32	69
CuDNN-R2 (Torch)	231	70	161
TensorFlow	326	96	230

Overfeat [fast] - Input 128x3x231x231

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	326	113	213
fbfft (Torch)	342	114	227
CuDNN-R2 (Torch)	810	234	576
TensorFlow	1084	316	768

OxfordNet [Model-A] - Input 64x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
Nervana	590	180	410
CuDNN-R3 (Torch)	615	196	418
CuDNN-R2 (Torch)	1099	342	757
TensorFlow	1840	545	1295

GoogleNet V1 - Input 16x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R2 (Torch)	564	174	390
TensorFlow	590	54	536

Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.

There you go.

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.

hughperkins commented 8 years ago

hjk41 wrote:

+1 for mxnet. Dynamic GPU memory allocation does have a big impact on performance. A simple memory allocator can significantly reduce the overhead. A smarter allocator which reuses blocks with best-fit can almost eliminate the overhead completely.

Hi mxnet guys, the title of this thread is 'Benchmark TensorFlow' ;-) I think you could create a new issue to request mxnet, using https://github.com/soumith/convnet-benchmarks/issues/new

[edit: looks like Soumith has created an issue for mxnet here: https://github.com/soumith/convnet-benchmarks/issues/68 ]

ces-bertino commented 8 years ago

I was curious about the performance of Tensorflow using CUDA 7.5 and CUDNN 7.0. I modified the build to use them and rebuilt source. I then ran @soumith benchmark scripts for alexnet and overfeat. My PC is getting old (Intel Core 2 quad), 16 GB RAM, NVIDIA Titan X, Ubuntu 15.04 x86_64

Alexnet: forward/backward 290ms, forward 78ms. (1.12x improvement for f/b) Overfeat: forward/backward 1040ms, forward 264ms (1.04x improvement for f/b).

So not much of a speedup by only swapping the CUDA libraries.

@soumith, one thing I did notice is that your benchmarks for Caffe are quite different from what I got using CUDA 7.5. I think the benchmarks you used are with CUDA 7.0, right?. When I ran your "run_imagenet.sh" script on my setup, I got much better results. Alexnet: forward/backward 171ms, forward 41ms ( 1.89x improvement for F/B) Overfeat: forward/backward 601ms, forward 133ms ( 1.37x improvement) Googlenet: F/B 624ms, F 174ms ( 3.1x improvement)

It's not clear to me if CUDA 7.5 is supported for Caffe, but in https://github.com/BVLC/caffe/wiki/Installation, they provide 7.0 and 7.5 Docker images. However, in the main installation instructions they say to use 7.0

I attached the log files for his benchmarks running on my PC.

Tensorflow:

tensorflow_alexnet.txt tensorflow_overfeat.txt

Caffe:

output_alexnet.txt output_googlenet.txt output_overfeat.txt output_vgg_a.txt

ytsaig commented 8 years ago

@soumith Thanks for an invaluable community service!

robotsorcerer commented 8 years ago

Nice!

xingdi-eric-yuan commented 8 years ago

@soumith Thanks you.

atenpas commented 8 years ago

@soumith Thank you for benchmarking!!

wickedfoo commented 8 years ago

@scott-gray Sorry to spam @soumith's TF discussions, but when I last played with integer division via magic number mul-and-shift on GPU, the performance I got (on K40 though) was about the same as straightforward division by unsigned int32; the compiler seemed to have strength reductions that it performed in this case. However, there was lower register usage, so using this technique in a kernel that actually does other things (like transposition) would probably help. This was at the SASS level on CUDA 6.5? though.

https://github.com/facebook/fbcuda/commit/d5c8b38b4071b0151b27293a67b27c3868a0f948

jdemouth commented 8 years ago

The compiler does a good job when the constant is known in advance. Was it your case?

ramarlina commented 8 years ago

Nice! Thank you @soumith

scott-gray commented 8 years ago

@wickedfoo The advantage of calculating the magic numbers manually is that the divisor is typically parameterized and so the compiler can't compute magic numbers ahead of time. So it then falls back to to using the floating point rcp operator and doing a bunch of corrections to make up for the potentially shorter mantissa of float (23 vs 32 bits).

To do an integer division and modulus with magic numbers reduces to just this code:

// j   = jrst / RST
// rst = jrst % RST
int j   = jrst * magic_RST; j >>= shift_RST;
int rst = jrst - j * RST;

If you know all those numbers fit in 16 bits you can use vmad from ptx or sass. That looks like this:

VMAD.U16.U16 j, jrst, magic_RST, RZ;
SHR.U32      j, j, shift_RST;
VMAD.U16.U16 rst, -j, RST, jrst;

Otherwise your multiplications are going to expand out to 3 XMADS each, regardless of the datatype used. It would be nice if the compiler was a little smarter about multiplication by using the minimal number of instructions for the given data types.

For larger values that might require 64 bit math, I use something like this:

      MOV  magicPQ,    param_magic_PQ;
      IADD negPQ, RZ, -param_grid_PQ;

      ISETP.NE.AND P1, PT, magicPQ, 1, PT;

      // m = blkMPQ / PQ
  @P1 XMAD     div1, blkMPQ,    magicPQ,    RZ;
  @P1 XMAD     div2, blkMPQ,    magicPQ.H1, RZ;
  @P1 XMAD     div3, blkMPQ.H1, magicPQ.H1, RZ;
  @P1 XMAD.CHI div1, blkMPQ.H1, magicPQ,    div1;
  @P1 IADD3.RS m, div1, div2, div3;
  @P1 SHR.U32  m, m,      param_shift_PQ;
 @!P1 SHR.U32  m, blkMPQ, param_shift_PQ;

      // pq = blkMPQ % PQ
      XMAD pq, negPQ, m, blkMPQ;
      XMAD.PSL pq, negPQ.H1, m, pq;

Integer division is essential for these multi dimensional tensors where you cant fit everything in just 3 block coordinates. For more advanced uses, you can leverage it to pack all your coordinates into a single blockIdx.x value, then completely remap the order in which the indexes are scheduled. I'm able to achieve 95% L2 hit rates using this in my winograd kernels. This is essential for good performance as the small 32x32 batched gemm tile is pretty high bandwidth.

karenyyng commented 8 years ago

Is there any benchmark showing the predictive performance? Computing fast but inaccurate predictions does not seem useful.

benoitsteiner commented 8 years ago

@scott-gray TensorFlow already uses fast integer division using the code in http://github.com/tensorflow/tensorflow/blob/master/third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorIntDiv.h. One of the issues is that a lot of the TensorFlow kernels use 64 bit integers to index tensors, which ends up slowing things down on GPU. This is being fixed.

hughperkins commented 8 years ago

karenyyng wrote:

Is there any benchmark showing the predictive performance? Computing fast but inaccurate predictions does not seem useful.

Ideally, they are all learning the exact same model, so the outputs should be identical (to within the bounds of rounding accuracy). A correctness check is not a bad idea though.

soumith commented 8 years ago

@vrv thanks a lot, trying out the BFC allocator now for vgg and googlenet models.

soumith commented 8 years ago

@ces-bertino the numbers I have with Caffe are with Caffe's native kernels (that's why the entry is marked as Caffe (native) ", I presume you have CuDNN, and hence have the speedups. To compare your entries, look at the entry marked as CuDNN

soumith commented 8 years ago

@vrv updated the table for VGG. Googlenet still goes OOM at batch size 128, but if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.

ajtulloch commented 8 years ago

@Yangqing, @benoitsteiner do you have a sense for how performance for these benchmarks depends on nvcc vs gpucc? Are the 10%-50% numbers in http://llvm.org/devmtg/2015-10/slides/Wu-OptimizingLLVMforGPGPU.pdf for ic1/ic2 applicable here?

rajatmonga commented 8 years ago

@ajtulloch gpucc does hide these latency issues vs nvcc as it seems to do a much better job at optimization. Using gpucc brings TensorFlow pretty close to the cuDNN[R2] numbers for AlexNet. We are working on bridging that gap for nvcc by addressing a number of specific issues that @benoitsteiner and @Yangqing mentioned earlier.

benoitsteiner commented 8 years ago

@ajtulloch The 2 main reasons why gpucc can generate faster code than nvcc are:

The fact that gpucc can replace 64 bit integer divisions with 32 bit divisions if the values stored in the 64 bit integers can actually fit in 32 bit. As we update the TensorFlow convolution kernels to use 32 bit indices the performance of the code generated by nvcc will start to approach that of the code generated by gpucc.
The fact that clang supports c++11 constant expressions much better than nvcc. Constant expressions allow us to generate much more efficient CUDA kernels. Unfortunately for the time being we have to disable this feature since the corresponding code doesn't compile with nvcc. I am rewritting the corresponding code to make it compatible with nvcc 7.5, and hopefully with nvcc 7.0 as well.

ajtulloch commented 8 years ago

@rajatmonga, @benoitsteiner that makes sense, thanks for that.

scott-gray commented 8 years ago

@benoitsteiner I'm curious how you guys are using integer division in your implementation. The only places I find a need to use it are in custom kernels where I'm unpacking multiple coordinates from a compound index.

On a related note, I should mention that I also have another simple technique I developed for when you don't know ahead of time the value of the divisor. It looks something like this:

// rcpRST = 1 / RST
I2F.F32.S32 rcpRST, RST;
MUFU.RCP rcpRST, rcpRST;

// c = crst / RST
I2F.F32.S32 crst, crst;
FMUL c, crst, rcpRST;
FFMA c, c, 5.9604644775390625e-08, c;
F2I.S32.F32.TRUNC c, c;

// rst = crst % RST
VMAD.U16.U16 rst, -c, RST, crst;

For most values the floating point reciprocal gets you the correct value. It's just when the numerator and denominator are very close that you need to correct for the missing precision in float32. This is a lot less code than the compiler would generate and is accurate for the range of values I need it for.

hughperkins commented 8 years ago

are in custom kernels where I'm unpacking multiple coordinates from a compound index

Somewhat related: question that I've been wondering about somewhat, and never quite got around to measuring: if we have two 8-bit integers, is it faster to store in separate registers/variables, or faster to pack into one register/variable, using bit-shifting? (Edit: I suppose this is a bit vague really, since it entirely depends on how they're being used ... but I guess the trade-off I'm thinking about is: packing multiple values into a few registers will reduce register pressure, but maybe the increase in processing time from all the bit-shifting offsets any benefti?) (Edit2: I suppose what I mean actually is, are there any best-practices/guidelines as far as this goes?)

jdemouth commented 8 years ago

If you have enough registers, do not pack the 8-bit numbers and use one register per element. Now, how do we define "enough registers"? Well, if the occupancy you get allows you to have enough warp parallelism (together with enough instruction level parallelism) to cover the latencies, you are good. In general, unless you have a clear use case, do not pack.

scott-gray commented 8 years ago

That would completely depend on the context in which you are using them. If you're short on register space, packing them might avoid some register spilling. Otherwise it's probably better to keep them separate. I'd also take a look at the video instructions like VMAD, VADD, VABSDIFF, etc. These can operate directly on packed 8 bit values. But in this mode these instructions are unfortunately only half throughput. Maybe this isn't a big deal for your application but, if you wanted to write a super efficient 8 bit gemm core, they're not ideal. These instructions are full throughput with packed 16 bit values and that is very interesting.. at least until Pascal rolls out with native fp16 support (or if you get a hold of an sm_53 X1)

Looks like @jdemouth beat me to it.

hughperkins commented 8 years ago

Thanks! :-)

benoitsteiner commented 8 years ago

@scott-gray We use integer division in order to extract the individual coordinates of a tensor coefficient from its compound index. We often use compound indices for 2 reasons:

they are independent from the rank and the shape of the tensor. This simplifies the fusion of primitive tensor operations. For example, if you reshape a 4D tensor into a 3D tensor all the coordinates need to be adjusted, but the compound indices remain the same.
they save registers compared to using individual coordinates. This often makes a significant difference on CPUs which don't have nearly as many registers as GPUs.

scott-gray commented 8 years ago

@benoitsteiner Ok, that makes sense now. For basic elementwise operations our backend just automatically reshapes all tensors involved in the kernel to the most efficient 2d shape. For broadcast/reduction/take/transpose type operations, it only currently supports those in 2d and requires the user to reshape things prior to performing those ops. This covers 99% of the use cases we've encountered but it sometimes does place a little extra burden on the user. On the other hand it is extremely fast. Sounds like you guys are shooting for much more general ndarray support in which case what you're doing sounds ideal.

milijan commented 8 years ago

Disclaimer: I am totally new to tensorflow and cudnn so I may not know what I am doing but very keen :)

So I built from source then realised that I already had R3 installed; I did what any other sensible person would do and replaced all R2 references with R3 and all seems well as far as running the models included.

@soumith @Yangqing, am I setting myself for trouble here? one word will suffice :)

Yangqing commented 8 years ago

@milijan You should be running fine. R3 seems to be binary compatible in the sense that most of the functions in R2 still exists in R3. I think R4 may break such hack because it will deprecate a few functions.

In case you are wondering, the reason you are not seeing any speedup by going to R3 may be as follows: in Tensorflow we hard code the cudnn algorithm to be NO_WORKSPACE, so some faster convolution paths are not being selected for now. Upcoming changes should further speed things up.

milijan commented 8 years ago

@Yangqing thanks! :+1:

raingo commented 8 years ago

A question with GoogleNet batch size.

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)

I can use up to 640 images per batch, using the graph from the tensorflow android example: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android

Why tensorflow can not survive 32 images in this benchmark?

My setup:

up to date tensorflow (9c3043ff3bf31a6a81810b4ce9e87ef936f1f529), compiled from scratch
K80 GPU with 12 GB memory

Here is the code to load the inception graph:

INPUT_SIZE = 224
OUTPUT_SIZE = 1024

# input should be: BS x INPUT_SIZE x INPUT_SIZE x 3 tensor
# output: BS x OUTPUT_SIZE
def inferences(images):
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(open('./tensorflow_inception_graph.pb').read())
    for n in graph_def.node:
        # control device from caller
        n.device = ''
    tf.import_graph_def(graph_def, input_map = {'input:0': images}, name = name)
    graph = tf.get_default_graph()
    output = graph.get_tensor_by_name(name + '/avgpool0:0')
    return tf.squeeze(output)

Given the big difference between 640 and 32, there must be something wrong. Either mine or this benchmark. Because tensorflow pre-allocating all memory, I don't know how much memory consumed exactly.

@soumith @Yangqing Please help!

vrv commented 8 years ago

@raingo: when training we keep the activations for the lower layers to compute the gradients, so a lot of intermediate memory is used during each training step. When doing inference, you only need the activations around to compute the next operation(s), and then they can be freed, so a lot less intermediate state is needed.

Also, based on the comment in https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-156357178, it sounds like GoogleNet training with TF might now work for up to batch 64, but not batch 128. (I'd be surprised if batch 32 doesn't work at HEAD, for sure.)

raingo commented 8 years ago

@vrv Gotta. Thanks!

DavidWiesner commented 8 years ago

@soumith They made some changes to tensorflow TensorFlow: Improve performance of Alexnet. can you update the benchmark for Alexnet

ozabluda commented 8 years ago

@soumith

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128) [...] if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.

For comparison, here are my measurements of approximate peak memory usage with Torch/cuDNNv3 on Titan-X:

AlexNet (128): 3 GB OverFeat (128): 5 GB VGG Model-A (128): OOM GoogLeNet(128): 9G

VGG Model-A-11 (64): 8 G VGG Model-B-13(64): 12 G (I think this may fall back on slower algos due to tight memory) VGG Model-D-16 (64): 12 G (I think this may fall back on slower algos due to tight memory) VGG Model-E-19 (64): 12 G (I think this may fall back on slower algos due to tight memory)

VGG Model-A-11 (96): 11 G

alexatknit commented 8 years ago

@soumith Since its release I've seen pretty dramatic improvements in tensorflow's memory management and performance. I think it may be time to benchmark 0.6.0.

soumith commented 8 years ago

@alexatknit will do. i will take some time one of these days to do MXNet, Chainer and TF 0.6. Have been a bit busy lately with wrapping up research.

hgaiser commented 8 years ago

I am looking forward to the updated comparison, have you found time to look into it?

soumith commented 8 years ago

TensorFlow Trunk as of 1 hour ago (post 0.6 release) numbers:

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	96	32	64
Nervana (Neon)	101	32	69
CuDNN-R2 (Torch)	231	70	161
TensorFlow 0.5	326	96	230
TensorFlow 0.6+	292	70	222

Overfeat [fast] - Input 128x3x231x231

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	326	113	213
fbfft (Torch)	342	114	227
CuDNN-R2 (Torch)	810	234	576
TensorFlow 0.5	1084	316	768
TensorFlow 0.6+	856	204	652

OxfordNet [Model-A] - Input 64x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
Nervana	590	180	410
CuDNN-R3 (Torch)	615	196	418
CuDNN-R2 (Torch)	1099	342	757
TensorFlow 0.5	1840	545	1295
TensorFlow 0.6+	1656	347	1309

GoogleNet V1 - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	431	117	313
TensorFlow 0.5	OOM	OOM	OOM
TensorFlow 0.6+	1237	246	991

There you go. The new logs are all checked in.

rajatmonga commented 8 years ago

@soumith Thanks for running the numbers again. I know you have been asked to do this a number of times lately and it takes you away from your research. Having these benchmarks have been greatly useful for everyone.

After your run we realized we seem to have regressed in performance since the 0.6.0 release (mostly from our switch over to the public Eigen branch) and over the last few days @zheng-xq and @benoitsteiner along with others have made improvements to get back the performance. When running the benchmarks again at commit d1b8333, we get the following numbers:

Model	Total (ms)	Forward (ms)	Backward (ms)
AlexNet	229	69	160
Overfeat [fast]	839	203	636
OxfordNet	1216	329	887
GoogleNet V1 - Input 128x3x224x224	815	234	581

This is measured on an unsuperclocked Titan-X with the default power-limit 250W.
For consistency, between each run, we wait for a few minutes for GPU to cool down to room temperature.

These results are also in line with what we see at 0.6.0 release.

We are also looking into setting up performance benchmarks with the builds so we don't hit such performance regressions.

Again, Thanks for all your updates.

vincenzocaselli commented 8 years ago

Does anyone has experiences and/or comparisons with DL4J (http://deeplearning4j.org) ?

soumith commented 8 years ago

@rajatmonga just got back from vacay. It's cool that you guys are setting up contbuilds for perf regressions.

However, I dont get the numbers that you seem to be getting on the tensorflow as of yesterday ( a27d844e05447e65aa279ae5269a2d75590f46f6 ). The numbers are slightly better but not quite the improvement that you are seeing.

Look here for the new numbers: https://github.com/soumith/convnet-benchmarks/commit/1f09e1e3b2841a2f58b83aa1d078a830bca4508f

rajatmonga commented 8 years ago

@soumith Thanks for running the benchmarks again. It is possible there are some memory related regressions that are hurting performance again. What you have right now is good, lets not worry about this.

We are working on getting cuDNN R4 fully supported and will address the remaining performance issues in that context. May ping this thread once we have a full release with R4, and it will be worthwhile rerunning benchmarks - likely for many of the libraries.

Also, let me know if we can help you with this project in any way - it is very useful to the community, but I am sure it takes a lot of your time as well. Thanks for keeping this going!

rajatmonga commented 8 years ago

Yes, That is in our list of tasks and is quite important to make sure we don't have performance regressions. We haven't been able to get to it yet.

On Thu, Feb 4, 2016 at 9:11 AM Madder notifications@github.com wrote:

Has anyone thought of running these benchmarks periodically as part of tensorflow's CI for instance?

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-179950846 .

cgel commented 8 years ago

Tf 0.7.0 released! Looking forward to the updated benchmarks.

MikalaiDrabovich commented 8 years ago

:+1: +1:

ronghanghu commented 8 years ago

Great results :+1: :+1: :+1:

Looking forward to the results with cuDNN v4

Madder commented 8 years ago

+1

On Tue, Feb 23, 2016 at 10:29 PM, Ronghang Hu notifications@github.com wrote:

Great results [image: :+1:] [image: :+1:] [image: :+1:]

Looking forward to the results with cuDNN v4

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-187919685 .

soumith commented 8 years ago

As requested, TF 0.7 + CuDNN R4 has been benchmarked. CuDNN R4 + Torch has also been benchmarked as a baseline.

Within the span of Nervana's Neon, Torch + CuDNN4, TensorFlow + CuDNN4 (and Caffe + CuDNN is likely in the same ballpark as torch), TensorFlow ( at commit https://github.com/tensorflow/tensorflow/commit/1d4f00da15a886916cd7a62ddf119b0b460c850c ) still lags behind the others by 2x to 3x performance on Alexnet, VGG and Googlenet. It is within 1.5x of Overfeat.

soumith commented 8 years ago

For full details, see the main README.md: https://github.com/soumith/convnet-benchmarks/blob/master/README.md and the raw logs are located here: https://github.com/soumith/convnet-benchmarks/commit/2888b23959190cefeee59cdd5e15f66a74031f8f

soumith / convnet-benchmarks

Benchmark TensorFlow #66