Closed soumith closed 8 years ago
hjk41 wrote:
+1 for mxnet. Dynamic GPU memory allocation does have a big impact on performance. A simple memory allocator can significantly reduce the overhead. A smarter allocator which reuses blocks with best-fit can almost eliminate the overhead completely.
Hi mxnet guys, the title of this thread is 'Benchmark TensorFlow' ;-) I think you could create a new issue to request mxnet, using https://github.com/soumith/convnet-benchmarks/issues/new
[edit: looks like Soumith has created an issue for mxnet here: https://github.com/soumith/convnet-benchmarks/issues/68 ]
I was curious about the performance of Tensorflow using CUDA 7.5 and CUDNN 7.0. I modified the build to use them and rebuilt source. I then ran @soumith benchmark scripts for alexnet and overfeat. My PC is getting old (Intel Core 2 quad), 16 GB RAM, NVIDIA Titan X, Ubuntu 15.04 x86_64
Alexnet: forward/backward 290ms, forward 78ms. (1.12x improvement for f/b) Overfeat: forward/backward 1040ms, forward 264ms (1.04x improvement for f/b).
So not much of a speedup by only swapping the CUDA libraries.
@soumith, one thing I did notice is that your benchmarks for Caffe are quite different from what I got using CUDA 7.5. I think the benchmarks you used are with CUDA 7.0, right?. When I ran your "run_imagenet.sh" script on my setup, I got much better results. Alexnet: forward/backward 171ms, forward 41ms ( 1.89x improvement for F/B) Overfeat: forward/backward 601ms, forward 133ms ( 1.37x improvement) Googlenet: F/B 624ms, F 174ms ( 3.1x improvement)
It's not clear to me if CUDA 7.5 is supported for Caffe, but in https://github.com/BVLC/caffe/wiki/Installation, they provide 7.0 and 7.5 Docker images. However, in the main installation instructions they say to use 7.0
I attached the log files for his benchmarks running on my PC.
Tensorflow:
tensorflow_alexnet.txt tensorflow_overfeat.txt
Caffe:
output_alexnet.txt output_googlenet.txt output_overfeat.txt output_vgg_a.txt
@soumith Thanks for an invaluable community service!
Nice!
@soumith Thanks you.
@soumith Thank you for benchmarking!!
@scott-gray Sorry to spam @soumith's TF discussions, but when I last played with integer division via magic number mul-and-shift on GPU, the performance I got (on K40 though) was about the same as straightforward division by unsigned int32; the compiler seemed to have strength reductions that it performed in this case. However, there was lower register usage, so using this technique in a kernel that actually does other things (like transposition) would probably help. This was at the SASS level on CUDA 6.5? though.
https://github.com/facebook/fbcuda/commit/d5c8b38b4071b0151b27293a67b27c3868a0f948
The compiler does a good job when the constant is known in advance. Was it your case?
Nice! Thank you @soumith
@wickedfoo The advantage of calculating the magic numbers manually is that the divisor is typically parameterized and so the compiler can't compute magic numbers ahead of time. So it then falls back to to using the floating point rcp operator and doing a bunch of corrections to make up for the potentially shorter mantissa of float (23 vs 32 bits).
To do an integer division and modulus with magic numbers reduces to just this code:
// j = jrst / RST
// rst = jrst % RST
int j = jrst * magic_RST; j >>= shift_RST;
int rst = jrst - j * RST;
If you know all those numbers fit in 16 bits you can use vmad from ptx or sass. That looks like this:
VMAD.U16.U16 j, jrst, magic_RST, RZ;
SHR.U32 j, j, shift_RST;
VMAD.U16.U16 rst, -j, RST, jrst;
Otherwise your multiplications are going to expand out to 3 XMADS each, regardless of the datatype used. It would be nice if the compiler was a little smarter about multiplication by using the minimal number of instructions for the given data types.
For larger values that might require 64 bit math, I use something like this:
MOV magicPQ, param_magic_PQ;
IADD negPQ, RZ, -param_grid_PQ;
ISETP.NE.AND P1, PT, magicPQ, 1, PT;
// m = blkMPQ / PQ
@P1 XMAD div1, blkMPQ, magicPQ, RZ;
@P1 XMAD div2, blkMPQ, magicPQ.H1, RZ;
@P1 XMAD div3, blkMPQ.H1, magicPQ.H1, RZ;
@P1 XMAD.CHI div1, blkMPQ.H1, magicPQ, div1;
@P1 IADD3.RS m, div1, div2, div3;
@P1 SHR.U32 m, m, param_shift_PQ;
@!P1 SHR.U32 m, blkMPQ, param_shift_PQ;
// pq = blkMPQ % PQ
XMAD pq, negPQ, m, blkMPQ;
XMAD.PSL pq, negPQ.H1, m, pq;
Integer division is essential for these multi dimensional tensors where you cant fit everything in just 3 block coordinates. For more advanced uses, you can leverage it to pack all your coordinates into a single blockIdx.x value, then completely remap the order in which the indexes are scheduled. I'm able to achieve 95% L2 hit rates using this in my winograd kernels. This is essential for good performance as the small 32x32 batched gemm tile is pretty high bandwidth.
Is there any benchmark showing the predictive performance? Computing fast but inaccurate predictions does not seem useful.
@scott-gray TensorFlow already uses fast integer division using the code in http://github.com/tensorflow/tensorflow/blob/master/third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorIntDiv.h. One of the issues is that a lot of the TensorFlow kernels use 64 bit integers to index tensors, which ends up slowing things down on GPU. This is being fixed.
karenyyng wrote:
Is there any benchmark showing the predictive performance? Computing fast but inaccurate predictions does not seem useful.
Ideally, they are all learning the exact same model, so the outputs should be identical (to within the bounds of rounding accuracy). A correctness check is not a bad idea though.
@vrv thanks a lot, trying out the BFC allocator now for vgg and googlenet models.
@ces-bertino the numbers I have with Caffe are with Caffe's native kernels (that's why the entry is marked as Caffe (native) ", I presume you have CuDNN, and hence have the speedups. To compare your entries, look at the entry marked as CuDNN
@vrv updated the table for VGG. Googlenet still goes OOM at batch size 128, but if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.
@Yangqing, @benoitsteiner do you have a sense for how performance for these benchmarks depends on nvcc vs gpucc? Are the 10%-50% numbers in http://llvm.org/devmtg/2015-10/slides/Wu-OptimizingLLVMforGPGPU.pdf for ic1/ic2 applicable here?
@ajtulloch gpucc does hide these latency issues vs nvcc as it seems to do a much better job at optimization. Using gpucc brings TensorFlow pretty close to the cuDNN[R2] numbers for AlexNet. We are working on bridging that gap for nvcc by addressing a number of specific issues that @benoitsteiner and @Yangqing mentioned earlier.
@ajtulloch The 2 main reasons why gpucc can generate faster code than nvcc are:
@rajatmonga, @benoitsteiner that makes sense, thanks for that.
@benoitsteiner I'm curious how you guys are using integer division in your implementation. The only places I find a need to use it are in custom kernels where I'm unpacking multiple coordinates from a compound index.
On a related note, I should mention that I also have another simple technique I developed for when you don't know ahead of time the value of the divisor. It looks something like this:
// rcpRST = 1 / RST
I2F.F32.S32 rcpRST, RST;
MUFU.RCP rcpRST, rcpRST;
// c = crst / RST
I2F.F32.S32 crst, crst;
FMUL c, crst, rcpRST;
FFMA c, c, 5.9604644775390625e-08, c;
F2I.S32.F32.TRUNC c, c;
// rst = crst % RST
VMAD.U16.U16 rst, -c, RST, crst;
For most values the floating point reciprocal gets you the correct value. It's just when the numerator and denominator are very close that you need to correct for the missing precision in float32. This is a lot less code than the compiler would generate and is accurate for the range of values I need it for.
are in custom kernels where I'm unpacking multiple coordinates from a compound index
Somewhat related: question that I've been wondering about somewhat, and never quite got around to measuring: if we have two 8-bit integers, is it faster to store in separate registers/variables, or faster to pack into one register/variable, using bit-shifting? (Edit: I suppose this is a bit vague really, since it entirely depends on how they're being used ... but I guess the trade-off I'm thinking about is: packing multiple values into a few registers will reduce register pressure, but maybe the increase in processing time from all the bit-shifting offsets any benefti?) (Edit2: I suppose what I mean actually is, are there any best-practices/guidelines as far as this goes?)
If you have enough registers, do not pack the 8-bit numbers and use one register per element. Now, how do we define "enough registers"? Well, if the occupancy you get allows you to have enough warp parallelism (together with enough instruction level parallelism) to cover the latencies, you are good. In general, unless you have a clear use case, do not pack.
That would completely depend on the context in which you are using them. If you're short on register space, packing them might avoid some register spilling. Otherwise it's probably better to keep them separate. I'd also take a look at the video instructions like VMAD, VADD, VABSDIFF, etc. These can operate directly on packed 8 bit values. But in this mode these instructions are unfortunately only half throughput. Maybe this isn't a big deal for your application but, if you wanted to write a super efficient 8 bit gemm core, they're not ideal. These instructions are full throughput with packed 16 bit values and that is very interesting.. at least until Pascal rolls out with native fp16 support (or if you get a hold of an sm_53 X1)
Looks like @jdemouth beat me to it.
Thanks! :-)
@scott-gray We use integer division in order to extract the individual coordinates of a tensor coefficient from its compound index. We often use compound indices for 2 reasons:
@benoitsteiner Ok, that makes sense now. For basic elementwise operations our backend just automatically reshapes all tensors involved in the kernel to the most efficient 2d shape. For broadcast/reduction/take/transpose type operations, it only currently supports those in 2d and requires the user to reshape things prior to performing those ops. This covers 99% of the use cases we've encountered but it sometimes does place a little extra burden on the user. On the other hand it is extremely fast. Sounds like you guys are shooting for much more general ndarray support in which case what you're doing sounds ideal.
Disclaimer: I am totally new to tensorflow and cudnn so I may not know what I am doing but very keen :)
So I built from source then realised that I already had R3 installed; I did what any other sensible person would do and replaced all R2 references with R3 and all seems well as far as running the models included.
@soumith @Yangqing, am I setting myself for trouble here? one word will suffice :)
@milijan You should be running fine. R3 seems to be binary compatible in the sense that most of the functions in R2 still exists in R3. I think R4 may break such hack because it will deprecate a few functions.
In case you are wondering, the reason you are not seeing any speedup by going to R3 may be as follows: in Tensorflow we hard code the cudnn algorithm to be NO_WORKSPACE, so some faster convolution paths are not being selected for now. Upcoming changes should further speed things up.
@Yangqing thanks! :+1:
A question with GoogleNet batch size.
Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
I can use up to 640 images per batch, using the graph from the tensorflow android example: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android
Why tensorflow can not survive 32 images in this benchmark?
My setup:
Here is the code to load the inception graph:
INPUT_SIZE = 224
OUTPUT_SIZE = 1024
# input should be: BS x INPUT_SIZE x INPUT_SIZE x 3 tensor
# output: BS x OUTPUT_SIZE
def inferences(images):
graph_def = tf.GraphDef()
graph_def.ParseFromString(open('./tensorflow_inception_graph.pb').read())
for n in graph_def.node:
# control device from caller
n.device = ''
tf.import_graph_def(graph_def, input_map = {'input:0': images}, name = name)
graph = tf.get_default_graph()
output = graph.get_tensor_by_name(name + '/avgpool0:0')
return tf.squeeze(output)
Given the big difference between 640 and 32, there must be something wrong. Either mine or this benchmark. Because tensorflow pre-allocating all memory, I don't know how much memory consumed exactly.
@soumith @Yangqing Please help!
@raingo: when training we keep the activations for the lower layers to compute the gradients, so a lot of intermediate memory is used during each training step. When doing inference, you only need the activations around to compute the next operation(s), and then they can be freed, so a lot less intermediate state is needed.
Also, based on the comment in https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-156357178, it sounds like GoogleNet training with TF might now work for up to batch 64, but not batch 128. (I'd be surprised if batch 32 doesn't work at HEAD, for sure.)
@vrv Gotta. Thanks!
@soumith They made some changes to tensorflow TensorFlow: Improve performance of Alexnet. can you update the benchmark for Alexnet
@soumith
Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128) [...] if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.
For comparison, here are my measurements of approximate peak memory usage with Torch/cuDNNv3 on Titan-X:
AlexNet (128): 3 GB OverFeat (128): 5 GB VGG Model-A (128): OOM GoogLeNet(128): 9G
VGG Model-A-11 (64): 8 G VGG Model-B-13(64): 12 G (I think this may fall back on slower algos due to tight memory) VGG Model-D-16 (64): 12 G (I think this may fall back on slower algos due to tight memory) VGG Model-E-19 (64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-A-11 (96): 11 G
@soumith Since its release I've seen pretty dramatic improvements in tensorflow's memory management and performance. I think it may be time to benchmark 0.6.0.
@alexatknit will do. i will take some time one of these days to do MXNet, Chainer and TF 0.6. Have been a bit busy lately with wrapping up research.
I am looking forward to the updated comparison, have you found time to look into it?
TensorFlow Trunk as of 1 hour ago (post 0.6 release) numbers:
AlexNet (One Weird Trick paper) - Input 128x3x224x224
Library | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|
CuDNN-R3 (Torch) | 96 | 32 | 64 |
Nervana (Neon) | 101 | 32 | 69 |
CuDNN-R2 (Torch) | 231 | 70 | 161 |
TensorFlow 0.5 | 326 | 96 | 230 |
TensorFlow 0.6+ | 292 | 70 | 222 |
Overfeat [fast] - Input 128x3x231x231
Library | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|
CuDNN-R3 (Torch) | 326 | 113 | 213 |
fbfft (Torch) | 342 | 114 | 227 |
CuDNN-R2 (Torch) | 810 | 234 | 576 |
TensorFlow 0.5 | 1084 | 316 | 768 |
TensorFlow 0.6+ | 856 | 204 | 652 |
OxfordNet [Model-A] - Input 64x3x224x224
Library | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|
Nervana | 590 | 180 | 410 |
CuDNN-R3 (Torch) | 615 | 196 | 418 |
CuDNN-R2 (Torch) | 1099 | 342 | 757 |
TensorFlow 0.5 | 1840 | 545 | 1295 |
TensorFlow 0.6+ | 1656 | 347 | 1309 |
GoogleNet V1 - Input 128x3x224x224
Library | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|
CuDNN-R3 (Torch) | 431 | 117 | 313 |
TensorFlow 0.5 | OOM | OOM | OOM |
TensorFlow 0.6+ | 1237 | 246 | 991 |
There you go. The new logs are all checked in.
@soumith Thanks for running the numbers again. I know you have been asked to do this a number of times lately and it takes you away from your research. Having these benchmarks have been greatly useful for everyone.
After your run we realized we seem to have regressed in performance since the 0.6.0 release (mostly from our switch over to the public Eigen branch) and over the last few days @zheng-xq and @benoitsteiner along with others have made improvements to get back the performance. When running the benchmarks again at commit d1b8333, we get the following numbers:
Model | Total (ms) | Forward (ms) | Backward (ms) |
---|---|---|---|
AlexNet | 229 | 69 | 160 |
Overfeat [fast] | 839 | 203 | 636 |
OxfordNet | 1216 | 329 | 887 |
GoogleNet V1 - Input 128x3x224x224 | 815 | 234 | 581 |
These results are also in line with what we see at 0.6.0 release.
We are also looking into setting up performance benchmarks with the builds so we don't hit such performance regressions.
Again, Thanks for all your updates.
Does anyone has experiences and/or comparisons with DL4J (http://deeplearning4j.org) ?
@rajatmonga just got back from vacay. It's cool that you guys are setting up contbuilds for perf regressions.
However, I dont get the numbers that you seem to be getting on the tensorflow as of yesterday ( a27d844e05447e65aa279ae5269a2d75590f46f6 ). The numbers are slightly better but not quite the improvement that you are seeing.
Look here for the new numbers: https://github.com/soumith/convnet-benchmarks/commit/1f09e1e3b2841a2f58b83aa1d078a830bca4508f
@soumith Thanks for running the benchmarks again. It is possible there are some memory related regressions that are hurting performance again. What you have right now is good, lets not worry about this.
We are working on getting cuDNN R4 fully supported and will address the remaining performance issues in that context. May ping this thread once we have a full release with R4, and it will be worthwhile rerunning benchmarks - likely for many of the libraries.
Also, let me know if we can help you with this project in any way - it is very useful to the community, but I am sure it takes a lot of your time as well. Thanks for keeping this going!
Yes, That is in our list of tasks and is quite important to make sure we don't have performance regressions. We haven't been able to get to it yet.
On Thu, Feb 4, 2016 at 9:11 AM Madder notifications@github.com wrote:
Has anyone thought of running these benchmarks periodically as part of tensorflow's CI for instance?
— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-179950846 .
Tf 0.7.0 released! Looking forward to the updated benchmarks.
:+1: +1:
Great results :+1: :+1: :+1:
Looking forward to the results with cuDNN v4
+1
On Tue, Feb 23, 2016 at 10:29 PM, Ronghang Hu notifications@github.com wrote:
Great results [image: :+1:] [image: :+1:] [image: :+1:]
Looking forward to the results with cuDNN v4
— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-187919685 .
As requested, TF 0.7 + CuDNN R4 has been benchmarked. CuDNN R4 + Torch has also been benchmarked as a baseline.
Within the span of Nervana's Neon, Torch + CuDNN4, TensorFlow + CuDNN4 (and Caffe + CuDNN is likely in the same ballpark as torch), TensorFlow ( at commit https://github.com/tensorflow/tensorflow/commit/1d4f00da15a886916cd7a62ddf119b0b460c850c ) still lags behind the others by 2x to 3x performance on Alexnet, VGG and Googlenet. It is within 1.5x of Overfeat.
For full details, see the main README.md: https://github.com/soumith/convnet-benchmarks/blob/master/README.md and the raw logs are located here: https://github.com/soumith/convnet-benchmarks/commit/2888b23959190cefeee59cdd5e15f66a74031f8f
Google's TensorFlow benchmarks are here!
I've run the benchmarks on the Imagenet Winners. When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.
With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):
Coming to the benchmarks:
The largest batch-size I could fit is 32 (tried 32, 64).AlexNet (One Weird Trick paper) - Input 128x3x224x224
Overfeat [fast] - Input 128x3x231x231
OxfordNet [Model-A] - Input 64x3x224x224
GoogleNet V1 - Input 16x3x224x224
Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.
There you go.
I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.