CLBlast deduplication - Githubissues

infinity0 commented 3 years ago

Hi, I am trying to package this for Debian. (I am the maintainer of the leela-zero Debian package as well.) I notice that both this project and Leela Zero uses the CLBlast kernels (cpp/external/clblast/*.opencl copied from here). In Debian we generally have a policy of not keeping duplicate libraries embedded in multiple programs. I am trying to figure out how to deduplicate CLBlast between KataGo vs Leela Zero, so that both can use a common system version of CLBlast.

I'm unfamiliar with the OpenCL ecosystem and how "system artifacts" (such as these kernels, perhaps) should be laid out when installed, so I would like your help. I notice that the CLBLast install process (cmake . && make && make install DESTDIR=tmp) does not install these kernel files anywhere, but only libclblast.so, some C headers, some test binary programs for tuning, and some other metadata.

It seems the way to use these kernels is to just #include them directly. So perhaps CLBLast should be installed them into /usr/include and KataGo/Leela-Zero could be including these system paths? Does that seem to fit with how things work in the OpenCL world? If so I could ask CLBlast to install these paths, and then KataGo could include them as if they were regular system C headers?

lightvector commented 3 years ago

Cool, thanks for reaching out!

It is possible that there are different "ways" to use OpenCL that I'm unfamiliar with it, but my understanding of things is that OpenCL is compiled at runtime, and KataGo and I think Leela Zero use it in that way. What happens is that the literal ascii text of the OpenCL kernel ends up embedded as a constant static string in the C++ code. When the user runs the program and loads a neural network, these strings, which contain the C-like-syntax source code for the kernels, are fed through the OpenCL API, which dispatches to the particular implementation of OpenCL that the user has on their system based on their vendor (NVIDIA, or AMD, or Intel, or whatever), which runs the vendor-specific compiler on those bits of source code to compile them into GPU or device code at that moment, just before actual use.

The issue that you might run into is that I think both Leela Zero and KataGo have made changes to the OpenCL code that have never been upstreamed back into CLBlast. Some of these changes may be specific to the particular use cases in KataGo and Leela Zero, or at least not as thoroughly tested for general use. Also, unfortunately, Leela Zero is under the GPL, which is incompatible with KataGo's license, which meant that KataGo was unable to share Leela Zero's changes to CLBlast and ended up implementing its own changes independently when changes were needed.

That's basically the state of things. What are your thoughts?

lightvector commented 3 years ago

And to clarify: aside from the fact that the user needs to have an OpenCL platform from some vendor installed on their system to include and link with, there aren't any further dependencies or artifacts or anything else with OpenCL code. At that point, the OpenCL API morally boils down to: "hand me a const char* containing kernel source at runtime, I compile it and give you back an opaque handle that you can use to run instances of that kernel on the user's hardware accelerator device".

infinity0 commented 3 years ago

Hey, thanks for the info!

When the user runs the program and loads a neural network, these strings, which contain the C-like-syntax source code for the kernels, are fed through the OpenCL API,

That could make part of this process easier, as this means the strings could in theory instead be loaded at runtime from another standard location in the system instead of from within the current program.

The issue that you might run into is that I think both Leela Zero and KataGo have made changes to the OpenCL code that have never been upstreamed back into CLBlast.

Yes, this is why at Debian we encourage sharing, so that everyone can benefit from each other's improvements.

As I understand, the main topic of relevance here is the API/ABI of the loaded code - presumably the program has to interact with it somehow? If these changes that KataGo have made to CLBlast change the interface then the task is harder, but if it is just bugfixes or internal logic tweaks, then those could be submitted back to CLBlast as patches. Could you give a summary of the changes you made, and whether they'd be suitable for submitting back to CLBlast, or whether KataGo could work without these changes?

Also, is there any particular reason you are using these kernels directly, and not going through the "regular" BLAS C/C++ API? I have now seen 2 programs using them directly, so I wonder if CLBLast should be supporting this as an official use-case.

lightvector commented 3 years ago

Just to circle back on this - I'm still aware that this is open, but have not gotten around to it, and might not get to it for quite a while more.

Anyways, grepping comments I left in the code, it seems like the major change is the following:

// MODIFIED from the original by David Wu ("lightvector") to add FP16 storage with FP32 compute as an option.

So at least at the time, clblast didn't allow for differing precision between the storage type and the compute type, which on some GPUs I believe can be a little faster sometimes, or at least less demanding on the GPU's memory.

Using the regular API probably would have been possible. It wasn't a choice I had a lot of spare bandwidth at the time to think about in the middle of other demanding things, and I knew how Leela Zero did it was working well for Leela Zero, so I mostly just mimicked them, so I guess the question might be why Leela Zero did it this way.

lightvector / KataGo

CLBlast deduplication #361