google / gemmlowp

Low-precision matrix multiplication
Apache License 2.0
1.77k stars 451 forks source link

SIMD back-end for IBM Power and Z #195

Open geert56 opened 4 years ago

geert56 commented 4 years ago

This is not really an issue. I'd like to know if there is an interest in incorporating code to support IBM's Power and Z architectures as a back-end. In-house me and a colleague actually worked on this and we have extensions ready for gemmlowp to run optimized on P and Z depending on compiler flags when these architectures are detected. In principle this does not touch or disrupt any of the existing code. Please comment on this issue and provide advice as to how best proceed.

bjacob commented 4 years ago

it'd be interesting to hear if there is interest from other users. feel free to use the gemmlowp 'google group' to reach more people. as far as google is concerned we are in the process of migrating to a successor of gemmlowp, named ruy, currently here as a subdirectory of tensorflow, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/ruy , but set to move to its own github tree in the near future. it's not 100% ready for wider contribution as documentation is missing but just as a heads-up.

geert56 commented 4 years ago

Thank you @bjacob for the reply. Our interest is mainly driven by the use of gemmlowp for quantized models in TensorFlow Lite. We want to have fast inference on Power and Z CPUs. It was an interesting exercise to code up matrix multiplication in Power and Z vector intrinsics. It also helps us to evaluate our SIMD instruction sets. We would like to share our work with the larger community. Pushing it back into mainstream gemmlowp source would simplify a dedicated TF Lite build for us. Is Google's roadmap to drop gemmlowp and adopt "ruy" for TF Lite? Then we might want to have a look at it and consider whether to port our work there. I am curious to hear more from others.

bjacob commented 4 years ago

TFLite has already switched to ruy on arm64. There is work underway on arm32 and x86. However given the complex landscape of inference backends at the moment it's hard to make guesses as to what tflite will end up using. Over the next few months these things should settle a bit.