This is a GPU-accelerated implementation of the GEMM matrix multiply function for the Raspberry Pi.
The core is an assembler loop for Broadcoms QPU processor, and is run as a custom program on their GPU. It produces a substantial speedup compared to an optimized CPU version, with the included test running in 500ms on my overclocked Pi, rather than 8,000 ms using the official Atlas library on Raspbian on the same device.
Download the repo, sudo apt-get install libatlas-dev m4
, run make, and then run sudo ./gemm
.
It always overwrites the output 'C' matrix, rather than incrementing it by 'beta'.
You have to run the program as 'su', so that the library can get direct access to the GPU.
All code is under the BSD three-clause license, included in this folder as LICENSE.
Written by Pete Warden at Jetpac Inc.
Thanks to eman on the Pi forums for the SHA-256 examples, Andrew Holme for creating the Fourier library, Herman Hermitage for his QPU documentation work, and Broadcom for releasing the hardware specifications of their GPU!