Benchmark TensorFlow - Githubissues

soumith commented 8 years ago

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners. When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):

in-place ReLU seems non-existent in practice.
- Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
VGG with batchsize 64 goes Out of Memory (Edit: VGG memory issue was solved by using the BFC allocator updated by GOOG). ~~The largest batch-size I could fit is 32 (tried 32, 64).~~
I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	96	32	64
Nervana (Neon)	101	32	69
CuDNN-R2 (Torch)	231	70	161
TensorFlow	326	96	230

Overfeat [fast] - Input 128x3x231x231

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	326	113	213
fbfft (Torch)	342	114	227
CuDNN-R2 (Torch)	810	234	576
TensorFlow	1084	316	768

OxfordNet [Model-A] - Input 64x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
Nervana	590	180	410
CuDNN-R3 (Torch)	615	196	418
CuDNN-R2 (Torch)	1099	342	757
TensorFlow	1840	545	1295

GoogleNet V1 - Input 16x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R2 (Torch)	564	174	390
TensorFlow	590	54	536

Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.

There you go.

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.

soumith commented 8 years ago

i have not changed the benchmark scripts in any way, so if the TF benchmark scripts need any change (such as new allocator settings etc.), I welcome the TF folks to let me know.

rajatmonga commented 8 years ago

Thanks Soumith@, this isn't quite where we had seen our numbers at, but we will look at the tests again and ping you if we notice something.

Thanks again for running these benchmarks!

On Sun, Feb 28, 2016, 4:32 PM Soumith Chintala notifications@github.com wrote:

i have not changed the benchmark scripts in any way, so if the TF benchmark scripts need any change (such as new allocator settings etc.), I welcome the TF folks to let me know.

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-189980365 .

soumith commented 8 years ago

Thanks Rajat, happy to investigate further. I built TF from source, and configured it with CUDA 7.5 + CuDNN-4, if that helps. The commit is https://github.com/tensorflow/tensorflow/commit/1d4f00da15a886916cd7a62ddf119b0b460c850c

nryant commented 8 years ago

I've had similar numbers using CUDA 7.0, cuDNN v4, and https://github.com/tensorflow/tensorflow/commit/b88971051fbc49fa1e0b91ec1b0b60defa11697e on a Titan X. Tried fiddling with device placement and the session config, but it made no material difference in the results. @rajatmonga, out of curiosity are you using cuDNN and nvcc internally, or gpucc?

soumith commented 8 years ago

@nryant Thanks for the additional data point. I am honestly very nervous whenever I have to deliver any negative news on convnet-benchmarks. fwiw, @spezzer on reddit also confirmed that it was a data layout thing as well https://www.reddit.com/r/MachineLearning/comments/487fmo/convnetbenchmarks_updated_with_numbers_for/d0i7ord . I'm closing this issue now, as we have benchmarked tensorflow across multiple versions and given it enough time and data. Will of course keep updating it over time as appropriate. Thanks all.

vrv commented 8 years ago

@soumith: I think in this case it's a combination of layout and some Eigen improvements that hadn't made its way upstream -- we're looking at both of these actively. Thanks again for your effort -- we'll let you know when it makes sense to update the numbers (and provide our own for comparison).

thinxer commented 8 years ago

A recent commit adds NCHW support for BiasAdd, which results in about 40% speed up.

https://github.com/tensorflow/tensorflow/commit/d6f3ebfdfc1d5b5df1f6ae73466abe2ec5721b5b

vrv commented 8 years ago

@thinxer: we'll let @soumith know when to update the numbers, but thanks for noticing :)

soumith commented 8 years ago

That's really cool, thanks for letting me know. I'm doing a new, complete set of benchmarks for deep learning, not just convnets, will cover this commit in them

rajatmonga commented 8 years ago

Thanks @soumith! No rush though.

We have most of the pieces together to support NCHW and expect to see more gains once we update the models to use that. Will ping you once that is ready as well. This commit helps quite a bit (was another regression on our part). Of course the layout changes will mostly help convnets and not other kinds of models.

On Sat, Mar 5, 2016 at 9:35 PM Soumith Chintala notifications@github.com wrote:

That's really cool, thanks for letting me know. I'm doing a new, complete set of benchmarks for deep learning, not just convnets, will cover this commit in them

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/66#issuecomment-192809431 .

shendiaomo commented 8 years ago

How about tensorflow 0.7？

ghost commented 8 years ago

Thanks for the benchmark @soumith . Looking forward for new updated TensorFlow.

soumith / convnet-benchmarks

Benchmark TensorFlow #66