Closed peastman closed 1 week ago
@ex-rzr I'm trying to include the HIP platform in OpenMM pip packages. I'm running into problems installing the SDK.
On Linux (manylinux_2_28, which is based on AlmaLinux 8), I try to install with the command
amdgpu-install --usecase=hiplibsdk
It fails with the error
Error: Unable to find a match: kernel-devel-6.5.0-1025-azure
Any idea how I can fix that and get it to install? It doesn't need to be able to run programs with HIP. I just need the libraries and headers to build against.
On Windows I have a more basic problem: I can't even download the SDK! The website requires you to click through a license agreement before you can download it. That prevents downloading it in any automated context, such as a Github Actions runner.
On Windows I have a more basic problem: I can't even download the SDK! The website requires you to click through a license agreement before you can download it. That prevents downloading it in any automated context, such as a Github Actions runner.
When I click the "Accept" button, it starts downloading. The link is https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-Win10-Win11-For-HIP.exe Can this link be used directly in your case?
Any idea how I can fix that and get it to install? It doesn't need to be able to run programs with HIP. I just need the libraries and headers to build against.
I don't know, but I'll investigate and ask around.
Does it work with --no-dkms
? amdgpu-install --no-dkms --usecase=hiplibsdk
I guess, it should be amdgpu-install -y --no-dkms --usecase=hiplibsdk
Does it work with --no-dkms? amdgpu-install --no-dkms --usecase=hiplibsdk
That worked. Thanks!
The problem now is that it's running out of disk space. It says the install needs 24 GB (!) of space, which is almost 13 GB more than is available. If we install with the package manager rather than amdgpu-install
, is it possible to make it install only the parts we need? It looks like most of the space is taken up by things OpenMM doesn't use. The really big ones are hipblaslt-devel
(over 5 GB just for that), rocblas
, rocsparse
, and rocsolver
.
Here are (modified) command from my old build.sh script for building OpenMM-HIP conda packages:
# EPEL repository is required for perl-File-BaseDir and perl-URI-Encode
yum -y install epel-release
# Install all required ROCm packages
yum -y install https://repo.radeon.com/amdgpu-install/6.2.2/el/8.10/amdgpu-install-6.2.60202-1.el8.noarch.rpm
yum -y install rocm-device-libs hip-devel hip-runtime-amd
epel-release
is still required, I don't see perl-File-BaseDir
and perl-URI-Encode
in your build logs. The log has "Dependencies resolved.", so I assume it's not needed.rocfft-devel hipfft-devel
are removed as we don't need it.Thanks! I'll try that.
I have noticed that the Windows build can't find HIP. You use LIST(APPEND CMAKE_MODULE_PATH "C:/Program Files/AMD/ROCm/6.1/cmake")
, when I build OpenMM-HIP locally I use -D CMAKE_PREFIX_PATH="C:\Program Files\AMD\ROCm\6.1"
(CMAKE_PREFIX_PATH
, no /cmake
, backslashes in the path).
Can it be the issue?
Perhaps calling cmake with --verbose --trace
can tell us more.
I'd also tried that, though specified in a slightly different way. Right now I'm working on getting a VM set up so I can test locally and figure out what works.
Everything is building successfully now. Thanks so much!
I have a related question. Do you know whether HIP maintains binary compatibility across releases? It affects how we package the HIP platform, and I can't find any documentation about it.
OpenCL maintains binary compatibility. If you compile against an old version, it also works with newer versions. That means we only need to compile the OpenCL platform once, and we can include it in the main wheel.
CUDA does not. If you compile against CUDA 11, it doesn't work with CUDA 12. That forces us to split it off into a separate wheel and build multiple versions of it. When you install, you have to specify what version you want with pip install openmm[cuda12]
.
What about HIP? Can we include it in the main wheel, or do we need to split it off into a separate versioned wheel that you install with pip install openmm[hip6]
?
Everything is building successfully now. Thanks so much!
Congratulations! I'm glad I could help.
CUDA does not. If you compile against CUDA 11, it doesn't work with CUDA 12.
I couldn't find an official statement but based on my previous experience I think that HIP is close to CUDA in this regard: within a major release there is a backward and forward compatibility.
OpenMM HIP does not depend on device code because of JIT, also hashes for kernel caching depend on the HIP runtime version, just in case: https://github.com/openmm/openmm/blob/8.2.0beta/platforms/hip/src/HipContext.cpp#L519 So this can't be a problem.
BUT the HIP platform libraries are linked to .so.6
sonames:
ldd libOpenMMHIP.so
...
libhiprtc.so.6 => /opt/rocm/lib/libhiprtc.so.6 (0x00007ac9c3bc2000)
...
libamdhip64.so.6 => /opt/rocm/lib/libamdhip64.so.6 (0x00007ac9c18ea000)
...
So it's not compatible to the old major release 5 as ROCm libs have different SONAME (and same for future releases, even if there will be no breaking changes in API of the runtime functions we use in the project).
It seems that having separate versions major ROCm release is the only way to handle it. At least that's how I understand the situation.
Thanks! That's what I'll do.
This is to include the HIP platform in PyPI packages.