CPU Convnet Benchmarks: Caffe vs. Torch Discrepancies (20x) on Jetson TX1 A57 CPU

ghost commented 8 years ago

Caffe is 20x faster than Torch when benchmarking the ARM Cortex A57 CPU on the NVIDIA Jetson TX1. I performed the same test on an Intel Xeon E5-2637 CPU using Caffe + openBLAS (CPU) vs. Torch + openBLAS (CPU) and the differences are fairly small (< 30% difference).

Does anyone have any tips/tricks to get the Torch CPU code to be on par with Caffe CPU code on the ARM A57?

Lua Benchmark:

Imports Caffe's bvlc_alexnet model to a nn specification in Lua using LoadCaffe (https://github.com/szagoruyko/loadcaffe).
Torch is installed using the standard installation method shown here http://torch.ch/docs/getting-started.html. OpenBLAS is detected, and I verify that there are 4 threads by looking at the number of luajit threads that are spawned whenever I call the benchmark.
'th benchmark.lua' will load the AlexNet model and will time the time it takes to perform model:forward(inputs) for some random inputs.

Test configuration as follows: model: bvlc_alexnet, batch_size = 100, input size = 3x227,227, iter = 1, threads = 4. The images per second (inference, forward pass only): 0.25 FPS (or 400,000 ms per batch of 100). My Lua benchmark code for the CPU can be downloaded here: http://homes.cs.washington.edu/~cdel/download/benchmark_A57.tgz

Caffe Benchmark:

I build Caffe with OpenBLAS, and I set OPENBLAS_NUM_THREADS = 4. GNU configure shows that OPENBLAS and Neon vector instructions are enabled on the ARM A57.
I run build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --iterations=1

Test configuration for Caffe for the ARM A57 CPU is: bvlc_alexnet, batch_size = 100, input_size = 3x227x27, iter = 1, threads = 4 and get a resulting images per second (inference, forward pass only): 5.2 FPS (or 19036 ms per batch of 100).

soumith commented 8 years ago

Torch's MM based convolutions on CPU use a lot more memory, and the shapes probably are not as optimized for OpenBLAS-ARM (as it unfolds all mini-batches and does a single MM call, rather than doing per-batch unfold + gemm in caffe). I'd suggest trying out: https://github.com/mvitez/OpenBLAS-conv https://github.com/mvitez/thnets

This very old fork of torch also has optimized assembly NEON based convolutions in there, but only for 32-bit ARM: https://github.com/soumith/torch-android/commit/af6dc1ed85eb9a37b0bd96b89cdc27bf68990176

soumith commented 8 years ago

that being said, 20x seems hugely suspect, as they are both calling gemm.

ghost commented 8 years ago

@soumith Thanks for the pointers! I'll try out the codes you've linked and post back here. If it is indeed a difference in unfolding all batches vs. per-batch unfolds (in Caffe), then it should make a huge difference. Thanks!

ghost commented 8 years ago

@soumith I've validated AlexNet using thnets (https://github.com/mvitez/thnets) and the TX1's ARM A57 CPU is now within 18% of the Caffe implementation (4.54 FPS on thnets vs. 5.3 FPS on Caffe) for a batch_size = 4 on thnets. I attribute the speedup to assembly-level intrinsics + highly optimized openBLAS kernels for the ARM platform.

I couldn't verify your claim that Torch unfolds all batches and performs a single MM call (and that Caffe unfolds per batch and performs multiple MM calls. Running ~/tegrastats to monitor memory usage, it appears that Torch (for my original Torch benchmark) actually uses less memory than Caffe.

Anyways, you solved my problem :) thanks man.

RParedesPalacios commented 8 years ago

Hi, these numbers seems very bad... Just if it helps let me say that I am developing my on toolkit for academic purposes:

https://github.com/RParedesPalacios/Layers

(i have still to upload src code)

And AlexNet with batch=100 and only forward (inference) it takes 2 secs approx. I use lowering and all the batch unfolded. I will try to upload the code to try it.

regards

carlodelmundo-zz commented 8 years ago

@RParedesPalacios, are we talking about inference on the TX1's ARM CPU? If so, isn't 50 images per second unreasonable?

Say it's 720 MFLOP to perform a single forward pass for one image [1]. 50 images per second would roughly translate to 36 GFLOPS (720 MFLOP * 50 images/s). I suspect the peak performance of the TX1's ARM CPUs cannot surpass 10 GFLOPS even with NEON and multithreading enabled.

[1] https://groups.google.com/forum/#!topic/caffe-users/cUD3IF5NMOk

RParedesPalacios commented 8 years ago

@carlodelmundo, hi i misunderstood the point.. I read, "... Intel Xeon E5-2637 CPU ..." and i thought that the following numbers refer to that.. but i read again and i understand that all the numbers refer to the ARM CPU!.

sorry for that ,regards!

soumith / convnet-benchmarks

CPU Convnet Benchmarks: Caffe vs. Torch Discrepancies (20x) on Jetson TX1 A57 CPU #104