Build HIP platform - Githubissues

peastman commented 1 week ago

This is to include the HIP platform in PyPI packages.

peastman commented 1 week ago

@ex-rzr I'm trying to include the HIP platform in OpenMM pip packages. I'm running into problems installing the SDK.

On Linux (manylinux_2_28, which is based on AlmaLinux 8), I try to install with the command

amdgpu-install --usecase=hiplibsdk

It fails with the error

Error: Unable to find a match: kernel-devel-6.5.0-1025-azure

Any idea how I can fix that and get it to install? It doesn't need to be able to run programs with HIP. I just need the libraries and headers to build against.

On Windows I have a more basic problem: I can't even download the SDK! The website requires you to click through a license agreement before you can download it. That prevents downloading it in any automated context, such as a Github Actions runner.

ex-rzr commented 1 week ago

On Windows I have a more basic problem: I can't even download the SDK! The website requires you to click through a license agreement before you can download it. That prevents downloading it in any automated context, such as a Github Actions runner.

When I click the "Accept" button, it starts downloading. The link is https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-Win10-Win11-For-HIP.exe Can this link be used directly in your case?

Any idea how I can fix that and get it to install? It doesn't need to be able to run programs with HIP. I just need the libraries and headers to build against.

I don't know, but I'll investigate and ask around.

ex-rzr commented 1 week ago

Does it work with --no-dkms? amdgpu-install --no-dkms --usecase=hiplibsdk

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html#additional-options

ex-rzr commented 1 week ago

I guess, it should be amdgpu-install -y --no-dkms --usecase=hiplibsdk

peastman commented 1 week ago

Does it work with --no-dkms? amdgpu-install --no-dkms --usecase=hiplibsdk

That worked. Thanks!

The problem now is that it's running out of disk space. It says the install needs 24 GB (!) of space, which is almost 13 GB more than is available. If we install with the package manager rather than amdgpu-install, is it possible to make it install only the parts we need? It looks like most of the space is taken up by things OpenMM doesn't use. The really big ones are hipblaslt-devel (over 5 GB just for that), rocblas, rocsparse, and rocsolver.

ex-rzr commented 1 week ago

Here are (modified) command from my old build.sh script for building OpenMM-HIP conda packages:

# EPEL repository is required for perl-File-BaseDir and perl-URI-Encode
yum -y install epel-release

# Install all required ROCm packages
yum -y install https://repo.radeon.com/amdgpu-install/6.2.2/el/8.10/amdgpu-install-6.2.60202-1.el8.noarch.rpm
yum -y install rocm-device-libs hip-devel hip-runtime-amd

I don't know if installing epel-release is still required, I don't see perl-File-BaseDir and perl-URI-Encode in your build logs. The log has "Dependencies resolved.", so I assume it's not needed.
rocfft-devel hipfft-devel are removed as we don't need it.
Url is updated.

peastman commented 1 week ago

Thanks! I'll try that.

ex-rzr commented 1 week ago

I have noticed that the Windows build can't find HIP. You use LIST(APPEND CMAKE_MODULE_PATH "C:/Program Files/AMD/ROCm/6.1/cmake"), when I build OpenMM-HIP locally I use -D CMAKE_PREFIX_PATH="C:\Program Files\AMD\ROCm\6.1" (CMAKE_PREFIX_PATH, no /cmake, backslashes in the path). Can it be the issue? Perhaps calling cmake with --verbose --trace can tell us more.

peastman commented 1 week ago

I'd also tried that, though specified in a slightly different way. Right now I'm working on getting a VM set up so I can test locally and figure out what works.

peastman commented 1 week ago

Everything is building successfully now. Thanks so much!

I have a related question. Do you know whether HIP maintains binary compatibility across releases? It affects how we package the HIP platform, and I can't find any documentation about it.

OpenCL maintains binary compatibility. If you compile against an old version, it also works with newer versions. That means we only need to compile the OpenCL platform once, and we can include it in the main wheel.

CUDA does not. If you compile against CUDA 11, it doesn't work with CUDA 12. That forces us to split it off into a separate wheel and build multiple versions of it. When you install, you have to specify what version you want with pip install openmm[cuda12].

What about HIP? Can we include it in the main wheel, or do we need to split it off into a separate versioned wheel that you install with pip install openmm[hip6]?

ex-rzr commented 1 week ago

Everything is building successfully now. Thanks so much!

Congratulations! I'm glad I could help.

CUDA does not. If you compile against CUDA 11, it doesn't work with CUDA 12.

I couldn't find an official statement but based on my previous experience I think that HIP is close to CUDA in this regard: within a major release there is a backward and forward compatibility.

OpenMM HIP does not depend on device code because of JIT, also hashes for kernel caching depend on the HIP runtime version, just in case: https://github.com/openmm/openmm/blob/8.2.0beta/platforms/hip/src/HipContext.cpp#L519 So this can't be a problem.

BUT the HIP platform libraries are linked to .so.6 sonames:

ldd libOpenMMHIP.so
...
    libhiprtc.so.6 => /opt/rocm/lib/libhiprtc.so.6 (0x00007ac9c3bc2000)
...
    libamdhip64.so.6 => /opt/rocm/lib/libamdhip64.so.6 (0x00007ac9c18ea000)
...

So it's not compatible to the old major release 5 as ROCm libs have different SONAME (and same for future releases, even if there will be no breaking changes in API of the runtime functions we use in the project).

It seems that having separate versions major ROCm release is the only way to handle it. At least that's how I understand the situation.

peastman commented 1 week ago

Thanks! That's what I'll do.

peastman / openmm-build-wheels

Build HIP platform #6