Unify code between OpenCL and CUDA

peastman commented 4 years ago

This is an idea I've been thinking about recently. It isn't necessarily something we want to do right away. I just want to record my thoughts and see what other people have to say about it.

Large parts of the CUDA and OpenCL platforms are nearly identical. Converting a feature from one to the other is largely a matter of duplicating the code, then doing a series of find/replace translations. This certainly isn't ideal. Having duplicated code means changes have to be made in two places, and it creates a risk that a bug fixed in one won't get fixed in the other. It also increases the work needed to implement new features, since you need to create two similar but not identical versions of it.

In addition, there are lots of other compute APIs on the horizon. AMD is pushing ROCm. Apple wants everyone to use Metal. Supposedly Intel is working on yet something else for their upcoming discrete GPUs, though I haven't seen any details. And who knows what else might come along in the future? Currently I have no plans to add support for any of these, but it's possible we might want to some day, and we'd be much more likely to do it if it didn't require so much work.

I think there's a lot of room to reduce the amount of duplication. For example, OpenCLArray and CudaArray could share an abstract parent class that defines a common API. Likewise for OpenCLContext and CudaContext. The code used to define kernel arguments is different in OpenCL and CUDA, but it would be easy to create an abstract ComputeKernel class that provides a unified API for both of them. This would allow a large fraction of the C++ code to be shared between the two platforms, written in terms of abstract superclasses.

It might even be possible to share kernel code between the platforms. CUDA and OpenCL code are really very similar. A lot of the differences are minor matters of syntax. We could define a common set of functions and macros that would cover most of them. For example, KERNEL would be defined as __kernel on OpenCL and as extern "C" __global__ on CUDA. GLOBAL_SIZE would be defined as get_global_size(0) on OpenCL and as (blockDim.x*gridDim.x) on CUDA. This would make it possible to write a single kernel that would compile on both.

Clearly it would be a huge task to convert everything to work this way. But in the long run, it might still be a good idea.

jchodera commented 4 years ago

This is an excellent and forward-looking idea, but this problem can't be unique to OpenMM. How are other codes that need to support a variety of GPU (or other hardware accelerated) languages?

Are libraries like XLA advanced enough to allow the expressivity you need while also supporting the variety of backends you envision? These libraries have a great deal of additional developer time supporting them, which could be advantageous in getting access to new backends for free.

On the other hand, your envisioned reworking may minimize the effort necessary to add support for new platforms provided the differences are mostly cosmetic unless you fully abstracted the behavior into classes that could contain radically different implementations

peastman commented 4 years ago

XLA is a low level compiler back end designed specifically for TensorFlow. It isn't something that could be used for this.

The closest I've seen is ROCm, which is AMD's new project. It consists of the following parts:

A C++ API which is designed to be very similar to the CUDA API.
A compiler framework allowing you to mix host and device code in the same file, again like CUDA.
A choice of multiple languages for writing kernels.

In particular, they currently have two supported kernel languages. One is OpenCL. The other, called HIP, is designed to be as close as possible to CUDA. The goal is to make it as easy as possible to convert existing CUDA programs. In fact, they have a tool that will process your source code and convert it for you. And then you can support both platforms with a single code base, either compiling with AMD's compiler for their own GPUs, or running it through nvcc with a set of headers that allow HIP code to be compiled as CUDA.

It's a clever idea, and I can see why they've chosen this approach. It remains to be seen whether it will catch on and many people will use it. And of course, it's still limited to AMD and NVIDIA GPUs. If Intel decided to support it too, that would make it more interesting.

jchodera commented 4 years ago

XLA isn't completely TensorFlow-specific. I mentioned it because projects like JAX have grown up around it as a way to translate from a very convenient high-level representation (numpy) to target multiple backends. Projects like jax-md and timemachine have used this to create extremely concise, highly maintainable MD codes that can benefit from additional backend targets added to XLA.

ROCm looks interesting! I'll take a deeper look.

Then there's always Javascript, which seems to have taken over as the "universal language" that can run on everything, even GPUs...

jchodera commented 4 years ago

Is OpenACC still a thing?

peastman commented 4 years ago

XLA isn't completely TensorFlow-specific. I mentioned it because projects like JAX have grown up around it

JAX is functionally very similar to TensorFlow: a library for linear algebra and tensor operations. Hence, the compiler backend designed for one is also suitable for the other. Neither one is a general purpose compute language in the sense of CUDA or OpenCL.

Is OpenACC still a thing?

Yes, but it's a different sort of system. Its goal is to make it very easy to perform limited classes of calculations on a GPU. If you want some acceleration for minimal effort, OpenACC is a good approach. If you want maximum performance and you're willing to work to get it, then it's not the right choice.

proteneer commented 4 years ago

I think having a domain specific language for physics-based engines is pretty appealing. For me, two things I'd like to see:

Something like lepton/symbolic derivative support of first order derivatives of gradients (dU/dx and dU/dp) and second order derivatives (d^2U/dx^2, d^2U/dxdp) for bonded and nonbonded functional forms. Automatic differentiation is probably too slow.
Ability to emit to OpenCL/CUDA and CPU.

I'd really like to one day merge back the timemachine into the OpenMM but I'd need the above to above to make it doable.

peastman commented 4 years ago

That's a little beyond the scope of what I was contemplating! Though of course the code to generate CUDA or OpenCL from Lepton expressions is already there (in CudaExpressionUtilities and OpenCLExpressionUtilities). But mostly I'm just trying to reduce code duplication, not make writing GPU code easier.

Olllom commented 4 years ago

This was the first time that I had to translate code from CUDA to OpenCL so I am speaking from a limited experience. That said I really like the idea mainly because the syntax changes specifically in our Nose-Hoover code were so marginal. In other words, you would not have to come up with a "new language" to make this happen.

I would still anticipate that the effort for writing OpenMM GPU code would be comparable to what it currently is. Instead of converting CUDA kernels to OpenCL, it would be a port from CUDA to "OpenMM-CUDA" (at least when somebody encounters this for the first time). That should still be easy enough if the keywords are well documented. On the plus side, I really like the maintainability aspect of it and all the gains that you get from avoiding code duplication.

Maybe you could provide a little preprocessor for the CUDA compilation that walks through the CUDA kernels and throws warnings if it encounters any things that would not be properly converted to OpenCL. Something like "I found (blockDim.x*gridDim.x), please replace it by GLOBAL_SIZE".

I think it is a great idea!

tristanic commented 4 years ago

I'd suggest taking a look at Reikna (http://reikna.publicfields.net/en/latest/) - not sure how active its development is these days, but its aim is(was) exactly what you're suggesting here: to abstract away the syntactic differences between OpenCL and CUDA. Likely a reasonable source of ideas.

peastman commented 4 years ago

Thanks! It looks like they have some very similar ideas. Like this example kernel in their documentation:

KERNEL void multiply_them(
    GLOBAL_MEM float *dest,
    GLOBAL_MEM float *a,
    GLOBAL_MEM float *b)
{
  const SIZE_T i = get_local_id(0);
  dest[i] = a[i] * b[i];
}

peastman commented 4 years ago

And right on schedule, Intel just announced a beta of oneAPI, their new API for programming GPUs (and other processors). https://software.intel.com/sites/default/files/oneAPIProgrammingGuide_5.pdf. It looks like it's basically a layer on top of SYCL, which in turn is a layer on top of OpenCL. So thank goodness, they aren't trying to push a new language or programming model on us. Once you get below the surface, it's still just OpenCL.

andysim commented 4 years ago

I also think this is an excellent idea. Now that we've got a little info about oneAPI, it looks to me like a unified framework is feasible. One thing that might be worth considering is requiring OpenCL 2.1 (circa 2015) or even 2.2 (circa 2017), to get C++ features and bring it in line with the CUDA code - little differences like not being able to make references cause subtle differences in the OpenCL and CUDA kernels. I know that OpenMM offers support for a wide range of GPUs - does anybody know if bumping the OpenCL requirements would hurt in that regard?

tristanic commented 4 years ago

I think the problem there is that Apple froze support for OpenCL at 1.2. Of course, given that they're now planning to deprecate OpenCL support entirely, I suppose in one sense that's less of an issue than it used to be...

peastman commented 4 years ago

I believe NVIDIA is also still stuck at 1.2. A couple of years ago they announced beta support for a subset of OpenCL 2.0 features, but I don't think anything has happened beyond that.

peastman commented 4 years ago

I have the beginnings of an implementation at #2488.

jchodera commented 4 years ago

I guess this just became a lot more important now that CUDA will cease support for osx after CUDA 10.2: see the CUDA 10.2 Release Notes.

Lnaden commented 4 years ago

CUDA will cease support for osx after CUDA 10.2

They already stopped driver support earlier this year as well for the new OSX, so I guess this was the next logical step

vvsteg commented 4 years ago

LAMMPS has several variants of GPU-offloading. In the GPU-package there is the unification layer on top of CUDA and OpenCL called a Geryon library (see https://doi.org/10.1016/j.cpc.2010.12.021).

Recently, together with @eokuznetsov and @vsevak we have added ROCm HIP backend for this library (https://github.com/Vsevak/lammps/tree/gpu_hip_port/lib/gpu). The resulting performance of LAMMPS on AMD Radeon VII is about 20% better with the HIP-backend than with the OpenCL-backend.

peastman commented 4 years ago

Thanks, that's interesting to know. The URL for Geryon given in the paper doesn't work: http://users.nccs.gov/~wb8/geryon/index.htm. Do you know what the correct one is?

Also, I forgot to close this issue after merging #2488!

vvsteg commented 4 years ago

You can find the library here https://github.com/lammps/lammps/tree/master/lib/gpu/geryon (modified to remove files not necessary for the LAMMPS implementation).

One more thing concerning the subject: with the HIP-backend for LAMMPS we see nearly zero overhead vs the CUDA-backend when nvcc is used for HIP compilation for Nvidia GPUs. That confirms the published results (https://doi.org/10.1109/ISPASS.2018.00034).

jchodera commented 4 years ago

@vvsteg: This is neat! Does the documentation live somewhere else now? The link is broken in the README:

NOTE: This Geryon distribution has been modified to remove files not necessary for the LAMMPS implementation. The full distribution is available at http://users.nccs.gov/~wb8/geryon/index.htm

vvsteg commented 4 years ago

@jchodera: Google gives a couple of links that seem to contain a full version: https://github.com/scicomp/geryon and a bit newer https://framagit.org/adakite/LIGGGHTS-PUBLIC/tree/c64be22f712f309958589148d5a200e2bbcbad2c/lib/gpu/geryon

Or perhaps it is better to contact the author directly https://sites.google.com/site/wmbrown85/

jchodera commented 4 years ago

Thanks @vvsteg!

vvsteg commented 4 years ago

Btw, I have compared the OpenMM performance on Titan V (CUDA/OpenCL) and Radeon VII (OpenCL) using the ApoA1 test in mixed precision:

the specs of these cards are quite close but Titan V with CUDA is more than 2 time faster than Radeon VII with OpenCL (55.2 ns/day vs 25.4 ns/day).

At the same time OpenCL run on Titan V is only about 8% slower than the CUDA run (51.2 ns/day).

Are there any ideas why the OpenCL OpenMM code on AMD GPUs is that slow?

peastman commented 4 years ago

I'm not certain, but I've observed similar results. I've always suspected AMD's compiler wasn't producing very good code. That's supported by this comment from a few days ago, which says some compute benchmarks are much faster when using Metal instead of OpenCL: https://github.com/openmm/openmm/issues/2525#issuecomment-575294700

vvsteg commented 4 years ago

That's interesting. We have compared our new HIP-backend for LAMMPS with the OpenCL-backend on Radeon VII using a variant of the ApoA1 benchmark. With HIP we observe 30% higher performance (10.7 ns/day vs 8.0 ns/day). Presumably, it implies that the code produced from HIP is better. These numbers have been obtained a year ago with ROCm 2.2. Hope to check ROCm 3.0 soon.

openmm / openmm

Unify code between OpenCL and CUDA #2471