google / gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.
Apache License 2.0
5.94k stars 502 forks source link

[Suggestions] Low effort OpenMP, OpenACC, CBLAS compatible CPU & GPU acceleration + other improvements #28

Closed trholding closed 2 months ago

trholding commented 7 months ago

Acceleration

There may be a possibility to add support of multiple rudimentary acceleration methods to this project without much effort.

Please refer to run.c and its Makefile in my fork of llama.c

CBLAS:

https://github.com/trholding/llama2.c/blob/e8698eb31b26bd2f2922a2b48ef8a4b2fa8ad1a1/run.c#L86

If CBLAS support is implemented, then GPU acceleration via OpenCL through CLBlast library is just a drop in.

OpenMP & OpenACC:

https://github.com/trholding/llama2.c/blob/e8698eb31b26bd2f2922a2b48ef8a4b2fa8ad1a1/run.c#L110

Note, annotate hot parts for parallelism.

Other improvements:

Mozilla's llamafile like usability:

We invented the concept way before mozilla did it. We implemented embedded models and multi os binaries.

https://github.com/trholding/llama2.c/blob/e8698eb31b26bd2f2922a2b48ef8a4b2fa8ad1a1/run.c#L35

To make "build once run on any os" multi os binaries, build with cosmopolitan libc toolchain. Refer the Makefile but follow Cosmo docs as we use an older version.

I hope to build a Gemma 2 Everywhere OS demo similar to L2E OS. Is that naming by any chance disallowed by Google/Gemma copyrights?

austinvhuang commented 7 months ago

Hi @trholding thanks for the suggestions! I'm looking into options for accelerator support, ideally while still keeping things simple + not expanding the dependency footprint too much - thanks for the example + cblas pointer.

openmp - we're using a custom threadpool for parallelism, i'm not sure if openmp will add much but if there's benchmarks / places where it helps we can consider.

llamafile - i liked the idea of llamafile when i heard about it. We haven't been hands on with trying it and at least for the next week or two we're going to have our hands full with patches. If you or anyone wants to take a first stab at an integration let us know.

OS demo - this sort of thing was in the back of my mind as well, will be very interested to see what comes of it.

IANAL regarding project naming, but you might bring it up with the devrel folks at the google dev community under the #gemma channel https://discord.com/invite/google-dev-community (we might start our own community space for this project in the near future for technical discussions, but the devrel folks there are better for this sort of question.

jan-wassenberg commented 2 months ago

Regarding CBLAS: we have a pretty decent matmul now. There's an inherent advantage to DIY because it allows us to fuse our custom decompression with the matmul.

Cool that you also support embedded models. One important usability improvement that we hope get to soon is moving closer to an "everything in one data file" approach. Would be happy to discuss that with anyone interested; if so, please raise a separate issue.