google / ruy

Apache License 2.0
298 stars 83 forks source link

Performance benchmarks #195

Closed dev0x13 closed 4 years ago

dev0x13 commented 4 years ago

Are there any reliable benchmark results comparing ruy with other GEMM libraries such as gemmlowp and Eigen? I am really interested in this (in the context of Tensorflow Lite performance), but the only tiny piece of information I found so far is a blog post in Tensorflow blog mentioning that TF Lite with ruy enabled outperforming regular TF Lite (Better CPU performance section) when inferring on a single CPU core.

bjacob commented 4 years ago

Sorry I didn't see this earlier. Here are benchmarks comparing to gemmlowp and Eigen (see the different sheet tabs). https://docs.google.com/spreadsheets/d/1CB4gsI7pujNRAf5Iz5vuD783QQqO2zOu8up9IpTKdlU/edit#gid=692328710

dev0x13 commented 4 years ago

Thank you, this is very nice of you! These benchmarks looks very promising. The reason because I asked for them is that I have tried to speed up TF Lite (I guess it was 1.15.0 version) by enabling ruy at compile time, but for some reason it made almost no effect. Well, now I am going to make one more attempt on this.

bjacob commented 4 years ago

Ruy is the default in TFLite on ARM64 and ARM32. You can still disable it with --define=tflite_with_ruy=false to compare.

dev0x13 commented 4 years ago

Thank you for the clarification, but I build static TF Lite using Makefile instead of Bazel build configs. Seems like because of that ruy is disabled by default except when building for generic aarch64 build (which does not cover neither Android nor iOS): https://github.com/tensorflow/tensorflow/blob/47afb26c2d28991eeebea9c404da499c2a9c8148/tensorflow/lite/tools/make/Makefile#L186

bjacob commented 4 years ago

I see. I'm unfamiliar with the Makefile build of TFLite, but from your description it seems like it's not closely tracking the evolution of TFLite. If you need help with that, please consider filing an issue against TFLite, and let me know if you need help finding some relevant TFLite folks to look at it. Alternatively, consider using Bazel. There seems to be docs about that here, https://www.tensorflow.org/lite/guide/build_android

dev0x13 commented 4 years ago

I don't need help with that, but anyway thank you for carrying!

georgthegreat commented 1 year ago

Hi.

The performance comparison above was carried out almost three years ago. A lot has changes in both ruy, tflite and eigen since then.

Is it possible to publish benchmarking code somewhere to allow carrying out new performance comparisons?

bjacob commented 1 year ago

Not that much has been changing in ruy in 3 years. Core Ruy authors have moved to other teams within Google, and most of the momentum in TFLite CPU inference has been in a move towards https://github.com/google/XNNPACK as the execution engine. I and @silvasean have both joined https://github.com/openxla/iree . Anyway, that's why no one has bothered to rerun ruy benchmarks.

georgthegreat commented 1 year ago

TFLite CPU inference still supports ruy (as of 2.12.0), and disabling it add considerable about of code bloat to the target binary being built.

We will carry out some benchmarks internally though. I assume I should address further questions to tflite authors then.

bjacob commented 1 year ago

Right, best to discuss with TFLite. No one is saying that TFLite's internal kernels would disable ruy and go back to what it was using before (eigen and gemmlowp). What I'm saying is TFLite is relying less and less one those internal kernels and delegating more and more on XNNPACK.