Closed ghost closed 8 years ago
Torch's MM based convolutions on CPU use a lot more memory, and the shapes probably are not as optimized for OpenBLAS-ARM (as it unfolds all mini-batches and does a single MM call, rather than doing per-batch unfold + gemm in caffe). I'd suggest trying out: https://github.com/mvitez/OpenBLAS-conv https://github.com/mvitez/thnets
This very old fork of torch also has optimized assembly NEON based convolutions in there, but only for 32-bit ARM: https://github.com/soumith/torch-android/commit/af6dc1ed85eb9a37b0bd96b89cdc27bf68990176
that being said, 20x seems hugely suspect, as they are both calling gemm.
@soumith Thanks for the pointers! I'll try out the codes you've linked and post back here. If it is indeed a difference in unfolding all batches vs. per-batch unfolds (in Caffe), then it should make a huge difference. Thanks!
@soumith I've validated AlexNet using thnets (https://github.com/mvitez/thnets) and the TX1's ARM A57 CPU is now within 18% of the Caffe implementation (4.54 FPS on thnets vs. 5.3 FPS on Caffe) for a batch_size = 4 on thnets. I attribute the speedup to assembly-level intrinsics + highly optimized openBLAS kernels for the ARM platform.
I couldn't verify your claim that Torch unfolds all batches and performs a single MM call (and that Caffe unfolds per batch and performs multiple MM calls. Running ~/tegrastats to monitor memory usage, it appears that Torch (for my original Torch benchmark) actually uses less memory than Caffe.
Anyways, you solved my problem :) thanks man.
Hi, these numbers seems very bad... Just if it helps let me say that I am developing my on toolkit for academic purposes:
https://github.com/RParedesPalacios/Layers
(i have still to upload src code)
And AlexNet with batch=100 and only forward (inference) it takes 2 secs approx. I use lowering and all the batch unfolded. I will try to upload the code to try it.
regards
@RParedesPalacios, are we talking about inference on the TX1's ARM CPU? If so, isn't 50 images per second unreasonable?
Say it's 720 MFLOP to perform a single forward pass for one image [1]. 50 images per second would roughly translate to 36 GFLOPS (720 MFLOP * 50 images/s). I suspect the peak performance of the TX1's ARM CPUs cannot surpass 10 GFLOPS even with NEON and multithreading enabled.
[1] https://groups.google.com/forum/#!topic/caffe-users/cUD3IF5NMOk
@carlodelmundo, hi i misunderstood the point.. I read, "... Intel Xeon E5-2637 CPU ..." and i thought that the following numbers refer to that.. but i read again and i understand that all the numbers refer to the ARM CPU!.
sorry for that ,regards!
Caffe is 20x faster than Torch when benchmarking the ARM Cortex A57 CPU on the NVIDIA Jetson TX1. I performed the same test on an Intel Xeon E5-2637 CPU using Caffe + openBLAS (CPU) vs. Torch + openBLAS (CPU) and the differences are fairly small (< 30% difference).
Does anyone have any tips/tricks to get the Torch CPU code to be on par with Caffe CPU code on the ARM A57?
Lua Benchmark:
Test configuration as follows: model: bvlc_alexnet, batch_size = 100, input size = 3x227,227, iter = 1, threads = 4. The images per second (inference, forward pass only): 0.25 FPS (or 400,000 ms per batch of 100). My Lua benchmark code for the CPU can be downloaded here: http://homes.cs.washington.edu/~cdel/download/benchmark_A57.tgz
Caffe Benchmark:
build/tools/caffe time --model=models/bvlc_alexnet/deploy.prototxt --iterations=1
Test configuration for Caffe for the ARM A57 CPU is: bvlc_alexnet, batch_size = 100, input_size = 3x227x27, iter = 1, threads = 4 and get a resulting images per second (inference, forward pass only): 5.2 FPS (or 19036 ms per batch of 100).