Closed peastman closed 4 years ago
This is an excellent and forward-looking idea, but this problem can't be unique to OpenMM. How are other codes that need to support a variety of GPU (or other hardware accelerated) languages?
Are libraries like XLA advanced enough to allow the expressivity you need while also supporting the variety of backends you envision? These libraries have a great deal of additional developer time supporting them, which could be advantageous in getting access to new backends for free.
On the other hand, your envisioned reworking may minimize the effort necessary to add support for new platforms provided the differences are mostly cosmetic unless you fully abstracted the behavior into classes that could contain radically different implementations
XLA is a low level compiler back end designed specifically for TensorFlow. It isn't something that could be used for this.
The closest I've seen is ROCm, which is AMD's new project. It consists of the following parts:
In particular, they currently have two supported kernel languages. One is OpenCL. The other, called HIP, is designed to be as close as possible to CUDA. The goal is to make it as easy as possible to convert existing CUDA programs. In fact, they have a tool that will process your source code and convert it for you. And then you can support both platforms with a single code base, either compiling with AMD's compiler for their own GPUs, or running it through nvcc with a set of headers that allow HIP code to be compiled as CUDA.
It's a clever idea, and I can see why they've chosen this approach. It remains to be seen whether it will catch on and many people will use it. And of course, it's still limited to AMD and NVIDIA GPUs. If Intel decided to support it too, that would make it more interesting.
XLA isn't completely TensorFlow-specific. I mentioned it because projects like JAX have grown up around it as a way to translate from a very convenient high-level representation (numpy) to target multiple backends. Projects like jax-md and timemachine have used this to create extremely concise, highly maintainable MD codes that can benefit from additional backend targets added to XLA.
ROCm looks interesting! I'll take a deeper look.
Then there's always Javascript, which seems to have taken over as the "universal language" that can run on everything, even GPUs...
XLA isn't completely TensorFlow-specific. I mentioned it because projects like JAX have grown up around it
JAX is functionally very similar to TensorFlow: a library for linear algebra and tensor operations. Hence, the compiler backend designed for one is also suitable for the other. Neither one is a general purpose compute language in the sense of CUDA or OpenCL.
Is OpenACC still a thing?
Yes, but it's a different sort of system. Its goal is to make it very easy to perform limited classes of calculations on a GPU. If you want some acceleration for minimal effort, OpenACC is a good approach. If you want maximum performance and you're willing to work to get it, then it's not the right choice.
I think having a domain specific language for physics-based engines is pretty appealing. For me, two things I'd like to see:
I'd really like to one day merge back the timemachine into the OpenMM but I'd need the above to above to make it doable.
That's a little beyond the scope of what I was contemplating! Though of course the code to generate CUDA or OpenCL from Lepton expressions is already there (in CudaExpressionUtilities and OpenCLExpressionUtilities). But mostly I'm just trying to reduce code duplication, not make writing GPU code easier.
This was the first time that I had to translate code from CUDA to OpenCL so I am speaking from a limited experience. That said I really like the idea mainly because the syntax changes specifically in our Nose-Hoover code were so marginal. In other words, you would not have to come up with a "new language" to make this happen.
I would still anticipate that the effort for writing OpenMM GPU code would be comparable to what it currently is. Instead of converting CUDA kernels to OpenCL, it would be a port from CUDA to "OpenMM-CUDA" (at least when somebody encounters this for the first time). That should still be easy enough if the keywords are well documented. On the plus side, I really like the maintainability aspect of it and all the gains that you get from avoiding code duplication.
Maybe you could provide a little preprocessor for the CUDA compilation that walks through the CUDA kernels and throws warnings if it encounters any things that would not be properly converted to OpenCL. Something like "I found (blockDim.x*gridDim.x), please replace it by GLOBAL_SIZE".
I think it is a great idea!
I'd suggest taking a look at Reikna (http://reikna.publicfields.net/en/latest/) - not sure how active its development is these days, but its aim is(was) exactly what you're suggesting here: to abstract away the syntactic differences between OpenCL and CUDA. Likely a reasonable source of ideas.
Thanks! It looks like they have some very similar ideas. Like this example kernel in their documentation:
KERNEL void multiply_them(
GLOBAL_MEM float *dest,
GLOBAL_MEM float *a,
GLOBAL_MEM float *b)
{
const SIZE_T i = get_local_id(0);
dest[i] = a[i] * b[i];
}
And right on schedule, Intel just announced a beta of oneAPI, their new API for programming GPUs (and other processors). https://software.intel.com/sites/default/files/oneAPIProgrammingGuide_5.pdf. It looks like it's basically a layer on top of SYCL, which in turn is a layer on top of OpenCL. So thank goodness, they aren't trying to push a new language or programming model on us. Once you get below the surface, it's still just OpenCL.
I also think this is an excellent idea. Now that we've got a little info about oneAPI, it looks to me like a unified framework is feasible. One thing that might be worth considering is requiring OpenCL 2.1 (circa 2015) or even 2.2 (circa 2017), to get C++ features and bring it in line with the CUDA code - little differences like not being able to make references cause subtle differences in the OpenCL and CUDA kernels. I know that OpenMM offers support for a wide range of GPUs - does anybody know if bumping the OpenCL requirements would hurt in that regard?
I think the problem there is that Apple froze support for OpenCL at 1.2. Of course, given that they're now planning to deprecate OpenCL support entirely, I suppose in one sense that's less of an issue than it used to be...
I believe NVIDIA is also still stuck at 1.2. A couple of years ago they announced beta support for a subset of OpenCL 2.0 features, but I don't think anything has happened beyond that.
I have the beginnings of an implementation at #2488.
I guess this just became a lot more important now that CUDA will cease support for osx
after CUDA 10.2: see the CUDA 10.2 Release Notes.
CUDA will cease support for osx after CUDA 10.2
They already stopped driver support earlier this year as well for the new OSX, so I guess this was the next logical step
LAMMPS has several variants of GPU-offloading. In the GPU-package there is the unification layer on top of CUDA and OpenCL called a Geryon library (see https://doi.org/10.1016/j.cpc.2010.12.021).
Recently, together with @eokuznetsov and @vsevak we have added ROCm HIP backend for this library (https://github.com/Vsevak/lammps/tree/gpu_hip_port/lib/gpu). The resulting performance of LAMMPS on AMD Radeon VII is about 20% better with the HIP-backend than with the OpenCL-backend.
Thanks, that's interesting to know. The URL for Geryon given in the paper doesn't work: http://users.nccs.gov/~wb8/geryon/index.htm. Do you know what the correct one is?
Also, I forgot to close this issue after merging #2488!
You can find the library here https://github.com/lammps/lammps/tree/master/lib/gpu/geryon (modified to remove files not necessary for the LAMMPS implementation).
One more thing concerning the subject: with the HIP-backend for LAMMPS we see nearly zero overhead vs the CUDA-backend when nvcc is used for HIP compilation for Nvidia GPUs. That confirms the published results (https://doi.org/10.1109/ISPASS.2018.00034).
@vvsteg: This is neat! Does the documentation live somewhere else now? The link is broken in the README:
NOTE: This Geryon distribution has been modified to remove files not necessary for the LAMMPS implementation. The full distribution is available at http://users.nccs.gov/~wb8/geryon/index.htm
@jchodera: Google gives a couple of links that seem to contain a full version: https://github.com/scicomp/geryon and a bit newer https://framagit.org/adakite/LIGGGHTS-PUBLIC/tree/c64be22f712f309958589148d5a200e2bbcbad2c/lib/gpu/geryon
Or perhaps it is better to contact the author directly https://sites.google.com/site/wmbrown85/
Thanks @vvsteg!
Btw, I have compared the OpenMM performance on Titan V (CUDA/OpenCL) and Radeon VII (OpenCL) using the ApoA1 test in mixed precision:
the specs of these cards are quite close but Titan V with CUDA is more than 2 time faster than Radeon VII with OpenCL (55.2 ns/day vs 25.4 ns/day).
At the same time OpenCL run on Titan V is only about 8% slower than the CUDA run (51.2 ns/day).
Are there any ideas why the OpenCL OpenMM code on AMD GPUs is that slow?
I'm not certain, but I've observed similar results. I've always suspected AMD's compiler wasn't producing very good code. That's supported by this comment from a few days ago, which says some compute benchmarks are much faster when using Metal instead of OpenCL: https://github.com/openmm/openmm/issues/2525#issuecomment-575294700
That's interesting. We have compared our new HIP-backend for LAMMPS with the OpenCL-backend on Radeon VII using a variant of the ApoA1 benchmark. With HIP we observe 30% higher performance (10.7 ns/day vs 8.0 ns/day). Presumably, it implies that the code produced from HIP is better. These numbers have been obtained a year ago with ROCm 2.2. Hope to check ROCm 3.0 soon.
This is an idea I've been thinking about recently. It isn't necessarily something we want to do right away. I just want to record my thoughts and see what other people have to say about it.
Large parts of the CUDA and OpenCL platforms are nearly identical. Converting a feature from one to the other is largely a matter of duplicating the code, then doing a series of find/replace translations. This certainly isn't ideal. Having duplicated code means changes have to be made in two places, and it creates a risk that a bug fixed in one won't get fixed in the other. It also increases the work needed to implement new features, since you need to create two similar but not identical versions of it.
In addition, there are lots of other compute APIs on the horizon. AMD is pushing ROCm. Apple wants everyone to use Metal. Supposedly Intel is working on yet something else for their upcoming discrete GPUs, though I haven't seen any details. And who knows what else might come along in the future? Currently I have no plans to add support for any of these, but it's possible we might want to some day, and we'd be much more likely to do it if it didn't require so much work.
I think there's a lot of room to reduce the amount of duplication. For example, OpenCLArray and CudaArray could share an abstract parent class that defines a common API. Likewise for OpenCLContext and CudaContext. The code used to define kernel arguments is different in OpenCL and CUDA, but it would be easy to create an abstract ComputeKernel class that provides a unified API for both of them. This would allow a large fraction of the C++ code to be shared between the two platforms, written in terms of abstract superclasses.
It might even be possible to share kernel code between the platforms. CUDA and OpenCL code are really very similar. A lot of the differences are minor matters of syntax. We could define a common set of functions and macros that would cover most of them. For example,
KERNEL
would be defined as__kernel
on OpenCL and asextern "C" __global__
on CUDA.GLOBAL_SIZE
would be defined asget_global_size(0)
on OpenCL and as(blockDim.x*gridDim.x)
on CUDA. This would make it possible to write a single kernel that would compile on both.Clearly it would be a huge task to convert everything to work this way. But in the long run, it might still be a good idea.