glotzerlab / hoomd-blue

Molecular dynamics and Monte Carlo soft matter simulation on GPUs.
http://glotzerlab.engin.umich.edu/hoomd-blue
BSD 3-Clause "New" or "Revised" License
336 stars 133 forks source link

Convert from HIP to hipper #1063

Open joaander opened 3 years ago

joaander commented 3 years ago

Description

Replace all HIP calls with hipper calls. Continues #427.

Motivation and context

HIP is complex, out of our control, gets in the way, and often breaks things. There is no need to use it for CUDA builds. hipper is a thinner translation layer that @mphoward developed that works around these issues by using CUDA more directly than HIP for CUDA builds and falls back to HIP for AMD builds.

mphoward commented 2 years ago

When we do this, we should make sure to look at the CMake setup too. Right now, there is a lot of complicated code needed to trick CMake into using HIP as CUDA. It would be nice if we could only take this path when HIP is being used to do the compilation.

joaander commented 2 years ago

When we do this, we should make sure to look at the CMake setup too. Right now, there is a lot of complicated code needed to trick CMake into using HIP as CUDA. It would be nice if we could only take this path when HIP is being used to do the compilation.

Yes. We may also need to refactor the HIP/CUDA CMake code considerably in for this issue and #1101. This code was written for very early ROCm/HIP tools and has not been maintained over the years.

I don't have access to an AMD GPU system for testing yet. OLCF's test system is not yet open to INCITE projects and NCSA Delta (which will have only one AMD GPU node) has been delayed again. If these are still not available when I start working on this, I will make the conversion testing only on CUDA and then make the changes needed for AMD support later.

mphoward commented 2 years ago

I will make the conversion testing only on CUDA and then make the changes needed for AMD support later.

I support this. I also do not have an AMD system to test on, and ROCm/HIP has been unstable, at least in the past.

joaander commented 1 year ago

Revisiting this: It will be a significant effort to port HOOMD to hipper and may require updates to hipper itself. I need to look more into this before proceeding.

However, the alternative is to continue using only HIP. Current versions of HIP are no longer header only and require a build and install step. I find AMD's compilation documentation severely lacking. I can not expect the majority of HOOMD users to follow it. Additionally, HIP has not been updated on conda-forge for several years. This alternative will therefore require that we learn how to build and install HIP, document it for our users, and maintain a conda-forge package.

mphoward commented 1 year ago

I admittedly have not been trying to keep hipper up to date because (1) the features we are actually using are pretty minimal and (2) I don’t have any AMD GPUs for testing. We could be more active in this if we want to pursue using it throughout HOOMD.

If a conversion is going to be made, have you given any thought to Intel oneAPI? I haven’t tried it so I’m not sure what use/performance is actually like. That is probably even more work, though, I would imagine.

Using only HIP is a little more palatable now that the CMake build system is fixed, but I agree that the documentation is general very poor.

joaander commented 1 year ago

I also have not tried oneAPI, but I have spoken with people who have. It is a vastly different programming model to CUDA/HIP. In addition to rewrites of all kernels, it would require a complete overhaul of the memory management system as oneAPI requires the use of its provided memory management classes. I got conflicting answers on whether there was any possibility of oneAPI / CUDA interoperability. One knowledgeable individual indicated that it was not possible at all, requiring us to do a complete port or none at all.

oneAPI also has no support for zero-copy interoperability with Python at this time - which is one of the most popular features of v3.

For Intel GPU support, there is a third party package that implements HIP on Intel. The large DOE centers have an interest to support projects like that.

If oneAPI gains traction in the long run and replaces HIP in the community, we will need to consider a port then. At present, a oneAPI port would be require a massive time investment and would remove functionality.

Switching to hipper in the meantime will be time consuming, but not unduly so. If I convert to hipper, I can remove the outdated HIP and hipCUB submodules. It is only a matter of time before one of those runs into compatibility issues with a new CUDA version.

mphoward commented 1 year ago

OK! That all makes sense. oneAPI would be an enormous amount of work then, and it’s unclear how much traction it will have.

I favor the hipper approach, then, since it will allow true CUDA builds without any dependencies we have to maintain. It also means we don’t have to teach how to compile HIP.

jglaser commented 1 year ago

Please let me know if you need me to test any code on AMD.

hdelan commented 1 year ago

While SYCL (oneAPI) was originally designed with its own memory management model (the buffer/accessor model), SYCL 2020 has full support for USM, meaning memory management can be identical in SYCL as it is in CUDA/HIP (using malloc_devices, memcpys etc).

There are tools to automatically port CUDA code to SYCL, see https://www.intel.com/content/www/us/en/developer/articles/technical/syclomatic-new-cuda-to-sycl-code-migration-tool.html

SYCL/CUDA interoperability is fully supported through the use of host_task. See here for more https://github.com/codeplaysoftware/SYCL-For-CUDA-Examples/tree/master/examples/cuda_interop

Performance of SYCL vs native APIs is extremely competitive, perhaps @zjin-lcf can comment on the latest oneAPI vs HIP performance. SYCL also gives the advantage that single source SYCL code can be compiled to run on CUDA, HIP and Intel platforms (including openCL CPU platforms).

joaander commented 1 year ago

I spoke with Teja Alaghari in September 2022 and discussed the possibility of Intel developers providing a prototype port. I have not seen nor heard any progress on this. I am open to pull requests that add SYCL as an alternate code path to begin exploring the possibilities. If anyone does so, please base work on the trunk-major branch.

However, I am not interested in fully converting HOOMD-blue to SYCL at this time. The complete rewrite would require a massive amount of effort in porting and testing, while there are currently no national HPC centers with Intel GPUs, the longevity and stability of SYCL is unknown, there is no zero-copy interface to interact with SYCL memory buffers in Python, SYCL is not available on the conda-forge build system, and users on currently supported platforms would need to install additional dependencies to build and/or use HOOMD-blue. In other words, I do not have the free time available to invest in such a port. Even if I did, doing so would remove popular features and require users to make drastic changes in order to continue using HOOMD-blue.

zjin-lcf commented 1 year ago

@joaander

I'd like to bring your comments as feature request. Can you please explain "no zero-copy interface to interact with SYCL memory buffers in Python" ? Thank you for your comments about SYCL.

joaander commented 1 year ago

CuPy (https://cupy.dev/) provides the __cuda_array_interface__ which allows Python C extensions to directly access the GPU memory buffers without copying the data. We use this to provide users with direct access to particle and force data (e.g. https://hoomd-blue.readthedocs.io/en/v3.8.1/module-hoomd-data.html#hoomd.data.LocalSnapshotGPU) so that they can write Python extensions that customize their simulations with minimal overhead. This is popular because users prefer to write Python code over a compiled C++ extension.

hdelan commented 1 year ago

There are some Python projects that sit on the SYCL runtime and plugins. See https://github.com/IntelPython/dpctl and https://intelpython.github.io/dpnp/index.html . Both seem to be available on conda-forge. I am not sure if they have this zero-copy feature that you use, but can investigate.

Note that gromacs has chosen to use SYCL over HIP as the API to target AMD GPUs.

joaander commented 1 year ago

I see that dpctl and dpnp are available via the intel and anaconda channels. https://anaconda.org/search?q=dpnp https://anaconda.org/search?q=dpctl

Even though these are accessed with the conda package manager, these channels are not the same as the community driven conda-forge project where I distribute HOOMD-blue: https://conda-forge.org/ . The conda-forge ecosystem supports CUDA, but not HIP and not SYCL: https://conda-forge.org/docs/maintainer/knowledge_base.html#cuda-builds.

The Gromacs and NAMD developers are free to do as they choose. They both have a much larger group of developers than HOOMD-blue, thus I would presume that they are more willing and able to continually rewrite their code. Until there are a plurality of HPC systems available with Intel GPUs and 100% of all HOOMD-blue features can be supported via SYCL, there is no reason to waste our limited effort on a complete port.